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Abstract 


RNA-Seq and gene expression microarrays provide comprehensive profiles of gene 
activity, but lack of reproducibility has hindered their application. A key challenge 
in the data analysis is the normalization of gene expression levels, which is currently 
performed following the implicit assumption that most genes are not differentially 
expressed. Here, we present a mathematical approach to normalization that makes 
no assumption of this sort. We have found that variation in gene expression is much 
larger than currently believed, and that it can be measured with available assays. 
Our results also explain, at least partially, the reproducibility problems encountered 
in transcriptomics studies. We expect that this improvement in detection will help 
efforts to realize the full potential of gene expression prohling, especially in analyses 
of cellular processes involving complex modulations of gene expression. 
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Introduction 


Since the discovery of DNA structure by Watson and Crick, molecular biology has pro¬ 
gressed increasingly fast, with rapid advances in sequencing and related genomic technolo¬ 
gies. Among these, DNA microarrays and RNA-Seq have been widely adopted to obtain 
gene expression prohles, by measuring the concentration of tens of thousands of mRNA 


molecules in single assays Schena et ah, 1995 Lockhart et ah, 1996 Duggan et ah, 1999 


Mortazavi et ah, 2008 Wang et ah , 2009 . Despite their enormous potential Golub et ah 


1999 

van ’t Veer et ah 

2002 

Ivanova et ah 

2002 

Ghi et ah 

2003 , problems of repro- 

ducibility and reliability 


Tan et ah 

2003 

Frantz, 

2005; 

Gouzinf 

2006 have discouraged 


their use in some areas, e.g. biomedicine Michiels et ahl 2005 Weigelt and Reis-Filho 


2010 Brettingham-Moore et ah, 2011 Boutros, 2015 


The normalization of gene expression, which is required to set a common reference level 


among samples Irizarry et ah, 2003 Tarca et ah, 2006 Garber et ah, 2011 Gonesa 


et ah, 2016 , has been reported to be problematic, affecting the reproducibility of results 


with both microarray Shi et ah, 2006; Shippy et ah, 2006 Draghici et ah, 2006 and 


RNA-Seq Bullard et ah, 2010 Dillies et ah, 2013 Su et ah, 2014| Lin et ah, 2016 


Batch effects and their influence on normalization have recently received a great deal of 


attention Leek et ah, 2010 Reese et ah, 2013; Li et ah, 2014 , resulting in approaches 


aiming to remove unwanted technical variation caused by differences between batches of 
samples or by other sources of expression heterogeneity (Listgarten et ah 2010 Gagnon- 


Bartsch and Speed, 2012 Risso et ah, 2014 . A different issue, however, is the underlying 


assumption made by the most widely used normalization methods to date, such as Median 
and Quantile normalization |Bolstad et ah| 2003 for microarrays, or RPKM (Reads Per 
Kilobase per Million mapped reads) {Mortazavi et ah 2008 , TMM (Trimmed Mean of M- 
values) (Robinson and Oshlack, 2010], and DESeq Anders and Huber ,p010| normalization 


for RNA-Seq, which posit that all or most genes are not differentially expressed van de 


Peppel et ah 2003 Hannah et ah , 2008; Loven et ah 2012 Dillies et ah 2013 ; [Hicks 
and Irizarry, 2015] . Although it may seem reasonable for many applications, this lack-of- 
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variation assumption has not been confirmed. Moreover, results obtained with external 
controls [van de Peppel et al. , 2003 Hannah et ah, 2005, 2008; Loven et al., 2012 or with 


RT-qPCR Shi et ah, 2006; Bullard et ah, 2010 suggest that it may not be valid. 


Some methods have been proposed to address this issue, based on the use of spike-ins 


van de Peppel et ah, 2003 Hannah et ah, 2008; Loven et ah, 2012 , negative control 


probes (SQN, Subset Quantile normalization) |Wu and Aryee 2010 , or negative control 
genes (RUV-2, Remove Unwanted Variation, 2-step) [Gagnon-Bartsch and Spe^ 2012 


These methods use external or internal controls that are known a priori not to be differen¬ 


tially expressed Lippa et al., 2010 . Their applicability, however, has been limited by this 


requirement of a priori knowledge, which is rarely available for a sufficiently large number 
of controls. In addition, other methods have been proposed to address the lack-of-variation 
assumption by identifying a subset of non-differentially expressed genes from the assay 
data, such as Cross-Correlation normalization [Chua et al.[ |2006| , LVS (Least-Variant 


Set) normalization Calza et al., [2008 , and NVAS (Nonparametric Variable Selection and 


Approximation) normalization Ni et ah, 2008 . While LVS normalization requires setting 


in advance a number for the fraction of genes to be considered as non-differentially ex¬ 


pressed, with values in the range 40-60% Calza et al., 2008 , Cross-Correlation and NVAS 


normalization are expected to degrade in performance when more than 50% of genes are 


differentially expressed Chua et ah, 2006, Ni et ah, 2008 . More recently, CrossNorm has 


been introduced Cheng et al., 2016 , based on the mixture of gene expression distribu¬ 


tions from the experimental conditions. This method, however, has been proposed for 
two experimental conditions, and specially for paired samples. The extension of this ap¬ 
proach to experimental designs with unpaired samples and more than a few experimental 
conditions would lead, as far as we can hypothesize, to an unmanageable size of the data 
matrix to process. 


Thus, to clarify and overcome the limitations imposed by the lack-of-variation assumption, 
we have developed an approach to normalization that does not assume lack-of-variation 
and that is suitable to most real-world applications. Hence, we aimed to avoid the need 
of spike-ins, a priori knowledge of control genes, or assumptions on the number of dif- 
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ferentially expressed genes. The analysis of several gene expression datasets using this 
approach conhrmed that our methods reached these goals. Furthermore, our results show 
that assuming lack-of-variation can severely undermine the detection of gene expression 
variation in real assays. We have found that large numbers of differentially expressed 
genes, with substantial expression changes, are missed or misidentihed when data are 
normalized with methods that assume lack-of-variation. 


Results 


E. crypticus and Synthetic Datasets 


A large gene expression dataset was obtained from biological triplicates of Enchytraeus 
crypticus (a globally distributed soil organism used in standard ecotoxicity tests), sampled 
under 51 experimental conditions (42 treatments and 9 controls), involving exposure to 
several substances, at several concentrations and durations according to a factorial design 
(Supplementary Table SI). Gene expression was measured using a customized high- 
density oligonucleotide microarray [Castro-Ferreira et ah 2014 , resulting in a dataset 
with 18,339 gene probes featuring good hybridization signal in all 153 samples. Taking 
into account the design of the microarray (Castro-Ferreira et ah 2014 , we refer to these 
gene probes as genes in what follows. 


To further explore and compare outcomes between normalization methods, two synthetic 
random datasets were built and analyzed. One of them was generated with identical means 
and variances gene-by-gene to the real E. crypticus dataset, and under the assumption 
that no gene was differentially expressed. In addition, normalization factors were applied, 
equal to those obtained from the real dataset. Thus, this synthetic dataset was similar 
to the real one, while complying by construction with the lack-of-variation assumption. 
The other synthetic dataset was also generated with comparable means and variances to 
the real dataset and with normalization factors, but in this case differential expression 
was added. Depending on the experimental condition, several numbers of differentially 
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expressed genes and ratios between over- and under-expressed genes were introduced 
(see Methods). Together, these synthetic datasets with and without differential gene 
expression represent, respectively, the alternative and null hypotheses for a statistical test 
of differential gene expression. 


Normalization Methods 


The gene expression datasets were normalized with four methods. Two of these methods 
are the most widely used procedures for microarrays, namely Median (or Scale) normal¬ 
ization and Quantile normalization Bolstad et al.| 2003| . (Note that current methods 
of normalization for RNA-Seq, such as RPKM [Mortazavi et ah [2008 , TMM Robin¬ 


son 


and Oshlack, 2010 , and DESeq [Anders and Huber 2010 , perform between-sample 


normalization by introducing a scaling per sample obtained with some form of mean or 
median, using all or a large set of genes. Thus their performance, in what concerns 
the issues addressed here, is expected to be similar to that of Median normalization for 
microarrays.) 


The other two normalization methods were developed for this study, they being called Me¬ 
dian Condition-Decomposition normalization and Standard- Vector Condition-Decomposition 
normalization, respectively MedianCD and SVCD normalization in what follows. 


With the exception of Quantile normalization, all used methods apply a multiplicative 
factor to the expression levels of each sample, equivalent to the addition of a number in the 
usual log 2 -scale for gene expression levels. Solving the normalization problem consists of 
finding these correction factors. This problem can be exactly and linearly decomposed into 
several sub-problems: one within-condition normalization for each experimental condition 
and one final between-condition normalization for the condition averages (see Methods). 
In the within-condition normalizations, the samples (replicates) subjected to each ex¬ 
perimental condition are normalized separately, whereas in the final between-condition 
normalization average levels for all conditions are normalized together. Because there 
are no genes with differential expression in any of the within-condition normalizations. 
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the lack-of-variation assumption only affects the final between-condition normalization. 
The assumption is avoided by using, in this normalization, expression levels only from 
no-variation genes, defined as genes that show no evidence of differential expression under 
a statistical test. An important detail is that the within-condition normalizations ensure 
good estimates of the within-condition variances, which are required by the statistical 
test for identifying no-variation genes. This requisite also implies that a minimum of two 
samples is required per experimental condition. Both methods of normalization proposed 
here, MedianCD and SVCD normalization, follow this condition-decomposition approach. 

With MedianCD normalization, all normalizations are performed with median values, as 
in conventional Median normalization, but only no-variation genes are employed in the 
between-condition step. Otherwise, if all genes were used in this final step, the resulting 
total normalization factors would be exactly the same as those obtained with conventional 
Median normalization. 


For SVCD normalization, a vectorial procedure was developed to carry out each normal¬ 
ization step, called Standard-Vector normalization. The samples of any experimental con¬ 
dition, in a properly normalized dataset, must be exchangeable. In mathematical terms, 
the expression levels of each gene can be considered as an s-dimensional vector, where s is 
the number of samples for the experimental condition. After standardization (mean sub¬ 
traction and variance scaling), these standard vectors are located in a (s — 2)-dimensional 
hypersphere. The exchangeability mentioned above implies that, when properly normal¬ 
ized, the distribution of standard vectors must be invariant with respect to permutations 
of the samples and must have zero expected value. These properties allow to obtain a 
robust estimator of the normalization factors, under fairly general assumptions that do 
not imply any particular distribution of gene expression (see Methods). 


It is worth mentioning that the limit case when the number of samples is two (s = 2) 
represents a degenerate case for Standard-Vector normalization, in which the space of 
standard vectors reduces to a 0-dimensional space with only two points. In this degenerate 


case. Standard-Vector normalization is equivalent to global Loess normalization Yang 
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et al., 2002 Smyth and Speed, 2003 , i.e. Loess normalization withont correction for non¬ 


linearities with respect to the level of gene expression or microarray print-tips. In this 
sense, Standard-Vector normalization is a generalization to any nnmber of samples of the 
approach underlying the different types of Loess normalization. 


Normalization Results 

Figure [T] displays the results of applying the four normalization methods to the real and 
two synthetic datasets. Each panel shows the interquartile range of expression levels for 
the 153 samples, grouped in triplicates exposed to each experimental condition. Both 
Median and Quantile normalization (second and third rows) yielded similar outputs for 
the three datasets. In contrast, MedianCD and SVCD normalization (fourth and hfth 
rows) detected much greater variation between conditions in the real dataset and the 
synthetic dataset with differential gene expression. Conventional Median normalization 
makes, by design, the median of all samples to be the same, while Quantile normalization 
makes the full distribution of gene expression of all samples to be the same. Hence, if 
there were differences in medians or distributions between experimental conditions, both 
methods would have removed them. Such variation was indeed present in the synthetic 
dataset with differential gene expression (Fig. [^,n), and hence we can hypothesize the 
same for the real dataset (Fig. [^,m). 

Influence of No-Variation Genes on Normalization 

To clarify how MedianCD and SVCD normalization preserved the variation between con¬ 
ditions, we studied the influence of the choice of no-variation genes in the hnal between- 
condition normalization. To this end, we obtained the between-condition variation as 
a function of the number of no-variation genes, in two families of cases. In one family, 
no-variation genes were chosen in decreasing order of p-values from the statistical test 
used to analyze variation between conditions. In the other family, genes were chosen at 









random. The first option was similar to the approach implemented to obtain the resnlts 
presented in Fig. [I]-o, with the difference that, there, the no-variation genes were chosen 
antomatically, by a snbseqnent statistical test performed on the distribntion of p-valnes 
(see Methods). 

For the real dataset (Fig. [^), the random choice of genes resnlted in decays (n 

being the nnmber of chosen genes), followed by a platean. The decays reflect the 

error in the estimation of normalization factors. Selecting the genes by decreasing p- 
valnes, however, yielded a completely different resnlt. Up to a certain nnmber of genes, 
the variance remained similar, bnt for larger nnmbers of genes the variance dropped 
rapidly. Fignre shows, therefore, that between-condition variation was removed as 
soon as the between-condition normalizations nsed genes that changed in expression level 
across experimental conditions. The big circles in Fig. indicate the working points 
of the normalizations nsed for the resnlts displayed in Fig. [2,ni. In fact, these points 
slightly nnderestimated the variation between conditions. Althongh the statistical test 
for identifying no-variation genes ensnred that no evidence of variation was fonnd, the 
expression of some selected genes varied across conditions. 

The resnlts obtained for the synthetic dataset with differential gene expression (Fig. |^) 
were qnalitatively similar to those of the real dataset, bnt with two important differences. 
The amonnt of between-condition variation detected (by selecting no-variation genes by 
decreasing p-valnes) was smaller than with the real dataset, implying that the real dataset 
had larger differential gene expression. Additionally, the variation detected in the syn¬ 
thetic dataset had a simpler dependency on the nnmber of genes, an indication that the 
differential gene expression introdnced in the synthetic dataset had a simpler structure 
than that of the real dataset. 

Figure]^ shows the results for the synthetic dataset without differential gene expression. 
There were no plateaus when no-variation genes were chosen randomly, only decays, 
and differences were small when no-variation genes were selected by decreasing p-values. 
Big circles show that the working points of Fig. [T|,o were selected with all available genes 
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as no-variation genes, which is the optimum choice when there is no differential gene 
expression. 

Overall, Fig. shows that the between-condition variation displayed in Fig. [2,k,m,n is 
not an artifact caused by using an exceedingly small or extremely particular set of genes 
in the final between-condition normalization, but that this variation originated from the 
datasets. The positions of the big circles in Fig. [^highlight the good performance of the 
statistical approach for choosing no-variation genes in the normalizations carried out for 
Fig. 0-0. Besides, the residual variation displayed by the n decays implies that, as 
estimators of the normalization factors, SVCD normalization features smaller error than 
MedianCD normalization. 

Differential Gene Expression 

In what follows, we call detected positives the differentially expressed genes (DEGs) result¬ 
ing from the statistical analyses, treatment positives the DEGs introduced in the synthetic 
dataset with differential gene expression, true positives the detected positives which were 
also treatment positives, and false positives the detected positives which were not treat¬ 
ment positives. Gorresponding terms for negatives refer to genes which were not DEGs. 

Figure l^shows the numbers of DEGs detected in the real and synthetic datasets, for each 
of the 42 experimental treatments compared to the corresponding control (Supplementary 
Table S2), after normalizing with the four methods. For the real dataset (Fig. |^), the 
number of DEGs identified after MedianGD and SVGD normalization were much larger 
for most treatments, in some cases by more than one order of magnitude. For the synthetic 
dataset with differential gene expression (Fig. [^), results were qualitatively similar, but 
with less differential gene expression detected, consistently with Fig. [^,b. The number of 
treatment positives can be displayed in this case (empty black down triangles. Fig. [^), 
showing a better correlation, with MedianGD and SVGD normalization, between the 
number of treatment positives and detected positives. For the synthetic dataset without 
differential gene expression (Fig. |^), no DEG was found but for one or two DEGs in 
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two conditions. Given that the false discovery rate was controlled to be less than 5% per 
treatment, this is expected to happen when evaluating 42 treatments. 


Figure reports, for the real dataset, statistically signihcant changes of gene expression, 
that is, changes that cannot be explained by chance. Equally important is the effect size, 
i.e. the scale of detected variation in DEGs, which is displayed by Eig. The boxplots 
show absolute fold changes of expression level for all DEGs detected after applying each 
normalization method. MedianGD and SVGD normalization allowed to detect smaller 
changes of gene expression, which were otherwise missed when using Median and Quantile 
normalization. This differential gene expression detected with MedianGD and SVGD 
normalization can hardly be considered negligible, given that, for all treatments, the 
interquartile range of absolute fold changes was above 1.5-fold, and, for more than 28 
(67%) treatments, the median absolute fold change was greater than 2-fold. Interestingly, 
the scale of differential gene expression detected with MedianGD and SVGD normalization 
in this assay is of similar magnitude to those reported by studies of global mRNA changes 


using external controls with microarrays and/or RNA-Seq van de Peppel et al., 2003 


Loven et al., 2012 


Figure [^displays the balance of differential gene expression, i.e. the comparison between 
the number of over- and under-expressed genes, for the real dataset. The quantity in the 
y-axes is the mean of an indicator variable B, which assigns -|-1 to each over-expressed 
DEG and —1 to each under-expressed DEG. Hence, balance of differential gene expres¬ 
sion corresponds to R = 0, all DEGs over-expressed to R = -|-1, and, for example, 60% 
DEGs under-expressed to R = —0.2. As discussed below, and as it has been reported 


before Irizarry et ah, 2006 Galza et ah, 2008 Ni et al., 2008; Zhu et al., 2010 , the 


balance of differential gene expression has a strong impact on the performance of normal¬ 
ization methods. Figure shows that, regardless of the normalization method used, the 
unbalance of differential gene expression detected in the real dataset was substantial for 
most conditions. Detected unbalances were (in absolute value) larger with MedianGD and 
SVGD normalization, in both cases with more than 30 (71%) treatments having |R| > 0.5, 
that is, more than 75% of over- or under-expressed genes. Moreover, the differences be¬ 
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tween the unbalances detected with Median and Quantile normalization, on one hand, 
and MedianCD and SVCD normalization, on the other, were specially notorious for the 
treatments with more DEGs (treatments 26-42, Fig. |^). In those cases. Median and 
Quantile normalization resulted in the smallest detected unbalances, whereas MedianCD 
and SVCD normalization yielded the largest ones, with values near B = ±1 for all but 
two treatments. 

True differential expression was known, by construction, for the synthetic dataset with 
differential gene expression. Thus, Fig. shows for this dataset the true positive rate 
(ratio between true positives and treatment positives, also known as statistical power or 
sensitivity) and the false discovery rate (FDR, ratio between false positives and detected 
positives). With conditions 1 to 20, which correspond to those conditions with less than 
approximately 10% of treatment positives (Fig. empty black down triangles), the 
true positive rate was similarly low for all normalizations. Regarding the FDR, when 
the (total) number of detected positives was up to a few tens, variability of the FDR 
around the target bound at 0.05 is to be expected, given that the bound is dehned over an 
average of repetitions of the multiple-hypothesis test. Yet, the FDR obtained after Median 
and Quantile normalization was higher than the 0.05 bound for most conditions. More 
striking, however, was the behavior for conditions 21 to 42 (more than 10% of treatment 
positives). The true positive rates obtained after Median and Quantile normalization 
were much lower than those obtained with MedianCD and SVCD normalization, while 
the FDR after Median and Quantile normalization was clearly over the bound at 0.05. 
In comparison, MedianCD and SVCD normalization, besides offering better sensitivity of 
differential gene expression, maintained the FDR consistently below the desired bound. 

Figure further explores these results, by representing the true positive rate and false 
discovery rate (FDR) as a function of the unbalance between over- and under-expressed 
genes. Figure shows that the unbalance of differential gene expression was a key factor 
in the results obtained with Median and Quantile normalization. When most DEGs were 
over- or under-expressed, both the true positive rate and FDR degraded markedly after 
using Median or Quantile normalization. In contrast, the true positive rate and FDR 
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were not affected by the unbalance of differential gene expression when using MedianCD 
or SVCD normalization. 

Concerning the identihcation of no-variation genes, both MedianCD and SVCD normal¬ 
ization performed well. In the synthetic dataset without differential gene expression, both 
methods identihed all genes as no-variation genes, which is the best possible result. In the 
synthetic dataset with differential gene expression, 1,834 genes (10% of a total of 18,339 
genes) were, by construction, negatives across all treatments. MedianCD and SVCD nor¬ 
malization detected, respectively, 1,723 and 1,827 no-variation genes, among which 96.9% 
and 95.2% were true negatives. 


Analysis of the Golden Spike and Platinum Spike Datasets 

To provide additional evidence of the performance of MedianCD and SVCD normalization. 


we analyzed the Golden Spike Choe et ah, 2005 and Platinum Spike Zhu et al., 2010 


datasets. Both of them are artihcial real datasets, the largest ones for which true DEGs 
are known. Hence, they have been widely used to benchmark normalization methods 


2007; 

Pearson 

2008 

Galza et al.| 

2008 

Ni et al., 

2008 


Zhu et al., 2010 Cheng et ah, 2016 


The design of the Golden Spike dataset was questioned for reasons concerning, among 
others, the anomalous null distribution of p-values, the lack of biological replicates, and 


the high concentration of spike-ins Dabney and Storey, 2006 Irizarry et ah, 2006 Gaile 


and Miecznikowski, 2007 . Nevertheless, this dataset is worth considering here because 


it challenges what we claim are key capabilities of our approach, that is, to correctly 
normalize gene expression data when many genes are differentially expressed, even with 
large unbalance between over- and under-expression. This dataset consists of microarray 
data obtained with the Affymetrix GeneGhip DrosGenomel, with two experimental con¬ 
ditions and three technical replicates per condition. Excluding Affymetrix internal control 
probes, the dataset contains a total of 13,966 gene probe sets, of which 3,876 were spiked- 
in, which we call known in what follows. Among these, 1,328 (34.3%) were over-expressed 
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(known positives) to varying degrees between 1.1- and 4-fold, while the remaining 2,535 
(65.4%) were spiked-in at the same concentration in both conditions (known negatives). 
(Percentages do not add up to 100% because of a very small number of probe sets with 
weak matching to multiple clones Choe et al., 2005|.) 


In addition to the normalization methods used above, we included Cyclic Loess normal¬ 


ization Yang et al., 2002 Ballman et al., 2004 in this case, because it facilitates a better 


comparison of results with previous studies Choe et al., 2005 Schuster et al., 2007t |Pear-| 
2008 Calza et al., 2008 Ni et ah, 2008 Zhu et ah, 2010 Cheng et ah, 2016| . Figure]^ 


son 


summarizes the results obtained for the Golden Spike dataset, by displaying Receiver Op¬ 
erating Characteristic (ROC) curves for the detection of differential gene expression. The 
upper panel shows the true positive rate (as before, ratio between true positives and 
treatment positives) versus the false positive rate (ratio between false positives and treat¬ 
ment negatives), while the lower panel shows the number of true positives versus the 
number of false positives. In both cases, detected and treatment positives/negatives were 
restricted to known genes, following previous studies [Gaile and MiecznikowsId| 2007 


Schuster et ah, 2007 Pearson| 2008 . Doing otherwise would have given an excessively 
dominant role to the issue of cross-hybridization in the analysis of differential gene expres¬ 
sion [Schuster et ah , 2007j . Additionally, the analysis was performed using only probe sets 
with hybridization signal in all samples, with the aim of factoring out differences between 
normalization methods caused by the response to missing data. Results obtained without 


this restriction (Supplementary Fig. S3) or with f-tests instead of limma Ritchie et al 


2015 analysis (Supplementary Fig. S4) were very similar to those of Fig. 


The comparison of ROG curves shown in Fig. highlight the superior performance of 
MedianGD and, in particular, SVGD normalization. Dashed lines show results when 
the list of known negatives was given as an input to some of the normalization methods 
(something than cannot be done in real assays). It is remarkable that SVGD normalization 
featured equally well with or without this information. 


Points in Fig. [^indicate the results when controlling the false discovery rate (FDR) to be 
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below 0.01 (left point on each curve) or 0.05 (right point). Figure]^ shows reference lines 
for actual FDR equal to 0.01, 0.05, 0.1, 0.2 and 0.5 (from left to right). In all cases, the 
FDR was not adequately controlled, although the difference between intended and actual 
FDR was notably smaller with MedianCD and SVCD normalization. Lack of control of 


the FDR in the analysis of this dataset has been previously reported Choe et ah, 2005 


Pearson, 2008 . It is caused by the non-uniform (hence anomalous) distribution of p-values 


for negative genes, which results from the analysis of differential gene expression Dabney 


and Storey, 2006 Gaile and Miecznikowski, 2007 Fodor et ah, 2007; Pearson| 2008 . It has 


been argued that this anomalous distribution of p-values is, in turn, a consequence of the 
own experimental design of the dataset, in particular the lack of biological replication and 
the way clone aliquots were mixed to produce each gene group with a given fold change 


Dabney and Storey, 2006 . Later studies have attributed this issue mostly to non-linear 


or intensity-dependent effects, not properly corrected in the within-sample normalization 
step (e.g. background correction) of the analysis pipeline [Gaile and Miecznikowski 2007 
Fodor et ah, [2007 ; Pearson, 2008 Zhu et ah, 2010 


Goncerning the identihcation of no-variation genes, both MedianGD and SVGD normal¬ 
ization worked correctly. MedianGD normalization identihed 561 no-variation genes, of 
which 93.9% were known, and among which 84.1% were known negatives. SVGD normal¬ 
ization, in comparison, featured better detection, with 1,224 no-variation genes identihed, 
of which 94.4% were known, and among which 90.0% were known negatives. 


The design of the Platinum Spike dataset Zhu et ah, 2010 took into account the concerns 
raised by the Golden Spike dataset, offering a dataset with two experimental conditions 
and nine (three biological x three technical) replicates per condition, and including near 
50% more spike-ins. Besides, differential gene expression was balanced, with respect 
to both total mRNA amount and extent of over- and under-expression. Gene expres¬ 
sion data was obtained with Affymetrix Drosophila Genome 2.0 microarrays. Excluding 
Affymetrix internal control probes, the dataset contained a total of 18,769 probe sets, of 
which 5,587 were spiked-in, called known as above. Among these, 1,940 (34.7%) were dif¬ 
ferentially expressed (known positives) to varying degrees between 1.2- and 4-fold (1,057 


15 






















































over-expressed, 883 under-expressed), while the remaining 3,406 (61.0%) were spiked-in 
at the same concentration in both conditions (known negatives). 


Figure shows ROC curves for the Platinum Spike dataset. As above, only known genes 
were considered for detected and treatment positives/negatives. Additionally, gene probes 
were restricted to those with signal in all samples. Results obtained without this restric¬ 
tion (Supplementary Fig. S5) or with t-tests instead of limma analysis (Supplementary 
Fig. S6) were again very similar. In contrast to the Golden Spike dataset (Fig. [^, the 
performance concerning true and false positives resulting from the different normalization 
methods was much more comparable. In this case, MedianCD and SVCD normalization 
were only marginally better. Note, however, that the FDR was again not properly con¬ 
trolled (Fig. 1^). Similarly to the Golden Spike dataset, and despite biological replication 
and a different experimental setup, obtained distributions of p-values for negative genes 


have been reported to be non-uniform Zhu et ah, 2010 . This fact is consistent with 
previous arguments relating the lack of control of the FDR to a general problem concern¬ 


ing the correction of non-linearities in the preprocessing of microarray data Gaile and 


Miecznikowski, 2007 Fodor et al., 2007 Pearson, 2008; Zhu et ah, 2010 


Regarding the identihcation of no-variation genes, MedianGD and SVGD normalization 
also worked correctly with this dataset. MedianGD normalization identihed 2,090 no¬ 
variation genes, of which 95.4% were known, and among which 98.7% were known nega¬ 
tives. SVGD normalization featured slightly better, with 2,232 no-variation genes identi¬ 
hed, of which 95.3% were known, and among which 98.3% were known negatives. 


Discussion 

The lack-of-variation assumption underlying current methods of normalization was self- 
fulhlling, removing variation in gene expression that was actually present. Moreover, it 
had negative consequences for downstream analyses, as it removed potentially important 
biological information and introduced errors in the detection of gene expression. The 
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resulting decrease in statistical power or sensitivity is a handicap, which can be addressed 
by increasing the number of samples per experimental condition. However, degradation 
of the (already weak) control of the false discovery rate when using Median or Quantile 
normalization is a major issue for real-world applications. 

The removal of variation can be understood as additive errors in the estimation of nor¬ 
malization factors. Considering data and errors vectorially (see Methods), the length of 
each vector equals, after centering and up to a constant factor, the standard deviation of 
the data or error. Errors of small magnitude, compared to the data variance, would only 
have minor effects. However, errors of similar or greater magnitude than the data variance 
may, depending on the vector lengths and the angle between the vectors, severely distort 
the observed data variance. This will, in turn, cause spurious results in the statistical 
analyses. Furthermore, the angles between the data and the correct normalization factors 
(considered as vectors) are random, given that expression data reflect biological variation 
while normalization factors respond to technical variation. If the assay is repeated, even 
with exactly the same experimental setup, the errors in the normalization factors will vary 
randomly, causing random spurious results in the downstream analyses. This explains, at 
least partially, the lack of reproducibility found in transcriptomics studies, especially for 
the detection of changes in gene expression of small-to-medium magnitude (up to 2-fold), 
because variation of this size is more likely to be distorted by errors in the estimation 
of normalization factors. Accordingly, the largest differences in numbers of differentially 
expressed genes detected by Median and Quantile normalization, compared to MedianCD 
and SVCD normalization, occurred in the treatments with the smallest magnitudes of 
gene expression changes (Figs. 

The variation between medians displayed in Fig. [2,k,m,n may seem surprising, given 
routine expectations based on current methods (Fig. ,g,h). Nevertheless, this variation 
inevitably results from the unbalance between over- and under-expressed genes. As an 
illustration of this issue, let us consider a case with two experimental conditions, in which 
the average expression of a given gene is less than the distribution median under one 
condition, but greater than the median under the other. The variation of this gene alone 
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will change the value of the median to the expression level of the next ranked gene. 
Therefore, if the number of over-expressed genes is different from the number of under¬ 
expressed genes, and enough changes cross the median boundary, then the median will 
substantially differ between conditions. Only when differential expression is negligible or 
is balanced with the respect to the median, will the median stay the same. Note that this 
is a related but different requirement from the number of over- and under-expressed genes 
being the same. This argument applies equally to any other quantile in the distribution 
of gene expression. The case of Quantile normalization is the least favorable, because 
it requires that changes of gene expression are balanced with respect to all distribution 
quantiles. 


Compared with other normalization approaches that try to identify no-variation genes 


from expression data, such as Cross-Correlation Chua et al., 2006 , TVS Calza et ah 


2008 , or NVAS Ni et ah , 2008 normalization, our proposal is able to work correctly with 


higher degrees of variation in gene expression, given that those methods are not expected 
to work correctly when more than 50-60% of genes vary. The reason for this difference in 
performance lies in that those methods use a binning strategy over the average expression 
between conditions (Cross-Correlation, NVAS), or need to assume an a priori fraction 
(usually 40-60%) of non-differentially expressed genes (TVS). When the majority of genes 
are differentially expressed, very few of those bins may be suitable for normalization, or 
the assumed fraction of non-differentially expressed genes may not hold. In contrast, our 
approach makes one single search in a space of p-values, and without assuming any fraction 
of non-differentially expressed genes. As long as there are a sufficient number of non- 
differentially expressed genes, of the order of several hundreds, normalization is possible, 
including cases with global mRNA changes or transcriptional amplihcation (van de Peppel 


et ah 

2003 

|Hannah et ah 

2005 


of comparison between the magnitude of the error in the estimation of normalization 
factors and the amount of biological variation. The estimation error decreases with the 
number of no-variation genes detected (Fig. |^, and whenever normalization error is well 
below biological variation, normalization between samples will be correct and benehcial 
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for downstream analyses. 


Our approach to normalization is based in four key ideas: first, decomposing the nor¬ 
malization by experimental conditions and normalizing separately each condition before 
normalizing the condition means; second, using the novel Standard-Vector normaliza¬ 
tion (or alternatively median scaling) to perform each normalization; third, identifying 
no-variation genes from the distribution of p-values resulting from a statistical test of 
variation between conditions; and fourth, employing only no-variation genes for the hnal 
between-condition normalization. These four ideas are grounded on rigorous mathemat¬ 
ical statistics (see Methods and Supplementary Information). It is also worth noting 
that both Median and Standard-Vector normalization, as methods for each normalization 
step, are distribution-free methods; they do not assume Gaussianity or any other kind 
of probability distribution for the expression levels of genes. MedianCD and SVCD nor¬ 
malization are freely available in the R package cdnormbio, installable from GitHub (see 
Gode Availability). 

Previous assumptions that gene variation is rather limited could suggest that there is no 
need for more comprehensive normalization methods such as our proposal. In line with 
this, it could be argued that the amount of variation in our real {E. crypticus) dataset 
is exceptional and much larger than the variation likely to be occur in most experiments. 
We think that this an invalid belief. Most of the available evidence concerning widespread 
variation in gene expression is inadequate, because it involves circular reasoning. We have 
shown here that current normalization methods, used by almost all studies to date, as¬ 
sume no variation in gene expression between experimental conditions, and they remove 
it if it exists, unless it is balanced. Therefore, these methods cannot be used to discern 
the extent and balance of global variation in gene expression. Only methods that are 
able to normalize correctly, whatever these extent and balance are, can be trusted for this 
task. The fact that our methods perform well with large and unbalanced differential gene 
expression does not imply that they perform poorly when differential gene expression is 
more moderate or balanced. Our results show that this is not case. In the design of our 
methods, no compromise was made to achieve good performance with high variation in 
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exchange for not so good performance with low variation. The downside of onr approach 
lies elsewhere, in a greater algorithmic complexity and a greater demand of compnting re- 
sonrces. Yet, we consider this a minor demand, given the capabilities of today’s compnters 
and the resonrces reqnired by cnrrent high-thronghpnt assays. 


Onr resnlts have being obtained from microarray data, but similar effects are expected 
to be found in RNA-Seq assays. Current normalization procedures for RNA-Seq, such 
as RPKM [Mortazavi et ah 2008 , TMM Robinson and Oshlack, 2010 , or DESeq An¬ 


ders and Huber, 2010 , perform between-sample normalization based on some form of 


global scaling and under the assumption that most genes are not differentially expressed. 
This makes RPKM, TMM, and DESeq normalization, in what concerns between-sample 
normalization and the removal or distortion of variation discussed here, similar to conven¬ 
tional Median normalization. An example of this issue, including results from microarray 
and RNA-Seq assays, has been reported in a study of the transcriptional amplihcation 


mediated by the oncogene c-Myc Loven et ah, 2012 


Importantly, MedianCD and SVCD normalization were designed with no dependencies 
on any particular aspect of the technology used to globally measure gene expression, i.e. 
microarrays or RNA-Seq. The numbers in the input data are interpreted as steady state 
concentrations of mRNA molecules, in order to identify the normalization factors, and 
irrespectively of whether the concentrations were obtained from fluorescence intensities 
of hybridized cDNA (microarrays) or from counts of fragments of cDNA (RNA-Seq). 
Both technologies require between-sample normalization, because in some step of the 
assay the total mRNA or cDNA mass in each sample must be equalized within a given 
range required by the experimental platform. This equalization of total mass, together 
with other sources of variation in the total efficiency of the assay, amounts to a factor 
multiplying the concentration of each mRNA species. This factor is different for each 
sample, and it is what between-sample normalization aims to detect and correct for. 
Moreover, the total mRNA mass in each sample is, in many cases, mostly determined 
by a few highly expressed genes, rather than an unbiased average over the total mRNA 
population. This makes between-sample normalization critical regarding comparisons of 
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gene expression between different experimental conditions, as our results illustrate. It is 
also important to highlight that this between-sample uncertainty in the measurement of 
mRNA concentrations is different from other issues, such as for example non-linearities. 
These other problems are usually more specific to each technology, and they are the 
scope of within-sample normalization (e.g. background correction for microarrays and 
gene-length normalization for RNA-Seq), which are obviously also necessary and should 
be applied before between-sample normalization. Similarly, methods that address the 
influence of biological or technical confounding factors on downstream analyses, such 
as SVA [Leek and Storey 2007 or PEER fStegle et al. , 2010 , should be applied when 
necessary, after normalizing. 

Finally, the significance of widespread variation in gene expression merits consideration 
from the viewpoint of molecular and cell biology. Established understanding about the 
regulation of gene expression considers it as a set of processes that generally switch on 
or off the expression of genes, performed mostly at transcription initiation, by the com¬ 
binatorial regulation of a large number of transcription factors, and with an emphasis on 
gene expression programs associated with cell differentiation and development. Recent 
studies, however, have expanded this understanding, offering a more complex perspective 
on the regulation of gene expression, by identifying other rate-limiting regulation points 
between transcription initiation and protein translation, such as transcription elongation 
and termination, as well as mRNA processing, transport and degradation. Promoter- 
proximal pausing of RNA polymerase II (in eukaryotes) [Core et 2008; Adelman and 


Lis, 2012 and transcript elongation Jonkers and Lis, 2015 , in particular, have received 


a great deal of attention recently, in connection with gene products involved in signal 
transduction pathways. These mechanisms, which seem to be highly conserved among 
metazoans, would allow cells to tune the expression of activated genes in response to sig¬ 
nals concerning, for example, homeostasis, environmental stress or immune response. As 
an illustration, studies about the transcription amplification mediated by the oncogene c- 
Myc have uncovered that it regulates the promoter-proximal pausing of RNA polymerase 
II, affecting a large number of genes already activated by other regulatory mechanisms 
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Lin et al., 2012 Nie et al., 2012; Littlewood et al., 2012 . Our results for the toxicity 


experiment with E. crypticus are consistent with regulatory capabilities for broad hne- 
tuning of gene expression levels, far beyond what conventional methods of normalization 
would allow to detect. This contrast underlines that normalization methods that truly 
preserve variation between experimental conditions are necessary for high-throughput as¬ 
says exploring genome-wide regulation of gene expression, as required by current research 
in molecular and cell biology. 


In summary, this study proves that large numbers of genes can change in expression level 
across experimental conditions, and too extensively to ignore in the normalization of gene 
expression data. Current normalization methods for gene expression microarrays and 
RNA-Seq, because of a lack-of-variation assumption, likely remove and distort variation 
in gene expression. The normalization methods proposed here solve this problem, offering 
a means to investigate broad changes in gene expression that have remained hidden to 
date. We expect this to provide revealing insights about diverse biomolecular processes, 
particularly those involving substantial numbers of genes, and to assist efforts to realize 
the full potential of gene expression prohling. 


Methods 


Test Organism and Exposure Media 


The test species was Enchytraeus crypticus. Individuals were cultured in Petri dishes 


containing agar medium, in controlled conditions Gomes et al., 2015b 


For copper (Cu) exposure, a natural soil collected at Hygum, Jutland, Denmark was 


used Gomes et al., 2015b, Scott-Fordsmand et al., [2000 . For silver (Ag) and nickel (Ni) 
exposure, the natural standard soil LUFA 2.2 (LUFA Speyer, Germany) was used [Gomes 


et ah, 2015b . The exposure to ultra-violet (UV) radiation was done in ISO reconstituted 


water OEGD, 2004a 


22 































Test Chemicals 


The tested Cu forms [Gomes et al. 2015b| included copper nitrate (Cu(N 03 ) 2 - 3 H 20 > 
99%, Sigma Aldrich), Cu nanoparticles (Cu-NPs, 20-30 nm, American Elements) and Cu 
nanowires (Cu-Nwires, as synthesized [Chang et ah 2005[). 


The tested Ag forms Gomes et al., 2015b[ included silver nitratre (AgNOa >99%, Sigma 
Aldrich), non-coated Ag nanoparticles (Ag-NPs Non-Coated, 20-30 nm, American Ele¬ 
ments), Polyvinylpyrrolidone (PVP)-coated Ag nanoparticles (Ag-NPs PVP-Coated, 20- 
30 nm, American Elements), and Ag NM300K nanoparticles (Ag NM300K, 15 nm, JRC 
Repository). The Ag NM300K was dispersed in 4% Polyoxyethylene Glycerol Triolaete 
and Polyoxyethylene (20) orbitan mono-Laurat (Tween 20), thus the dispersant was tested 
alone as control (CTdisp). 


The tested Ni forms included nickel nitrate (Ni(N03)2-61120 > 98.5%, Fluka) and Ni 
nanoparticles (Ni-NPs, 20 nm, American Elements). 


Spiking Procedure 


Spiking for the Cu and Ag materials was done as previously described Gomes et al. 


2015b . For the Ni materials, the Ni-NPs were added to the soil as powder, following 


the same procedure as for the Cu materials. NiNOs, being soluble, was added to the 
pre-moistened soil as aqueous dispersions. 


The concentrations tested were selected based on the reproduction effect concentrations 
EC20 and EC50, for E. crypticus, within 95% of conhdence intervals, being: CuNOs 
EC20/50 = 290/360 mgCu/kg, Cu-NPs EC20/50 = 980/1760 mgCu/kg, Cu-Nwires EC20/50 
= 850/1610 mgCu/kg, Cu-Field EC20/50 = 500/1400 mgCu/kg, AgNOs EC20/50 = 45/60 
mgAg/kg, Ag-NP PVP-coated EC20/50 = 380/550 mgAg/kg, Ag-NP Non-coated EC20/50 
= 380/430 mgAg/kg, Ag NM300K EC20/50 = 60/170 mgAg/kg, CTdisp = 4% w/w Tween 
20, NiNOs EC20/50 = 40/60 mgNi/kg, Ni-NPs EC20/50 = 980/1760 mgNi/kg. 
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Four biological replicates were performed per test condition, including controls. For Cu 
exposure, the control condition for all the treatments consisted of soil from a control area 
at Hygum site, which has a Cu background concentration of 15 mg/kg |Scott-Fordsmand 
et ah, 2000| . For Ag exposure, two control sets were performed: CT (un-spiked LUFA soil, 
to be the control condition for AgNOa, Ag-NPs PVP-Coated and Ag-NPs Non-Coated 
treatments) and CTdisp (LUFA soil spiked with the dispersant Tween 20, to be the 
control condition for the Ag NM300K treatments). For Ni exposure, the control consisted 
of un-spiked LUFA soil. 


Exposure Details 


In soil (i.e. for Cu, Ag and Ni) exposure followed the standard ERT OECD, 2004b 


with adaptations as follows: twenty adults with well-developed clitellum were introduced 
in each test vessel, containing 20 g of moist soil (control or spiked). The organisms 
were exposed for three and seven days under controlled conditions of photoperiod (16:8 
h hght:dark) and temperature 20 ± 1 °C without food. After the exposure period, the 
organisms were carefully removed from the soil, rinsed in deionized water and frozen in 
liquid nitrogen. The samples were stored at —80°C, until analysis. 


For UV exposure, the test conditions OECD, 2004a were adapted for E. crypticus Gomes 


et ah, 2015a . The exposure was performed in 24-well plates, where each well corresponded 


to a replicate and contained 1 ml of ISO water and hve adult organisms with clitellum. 
The test duration was hve days, at 20±1 °C. The organisms were exposed to UV on a daily 
basis, during 15 minutes per day to two UV intensities (280-400nm) of 1669.25 ± 50.83 
and 1804.08 ± 43.10 mW/m^, corresponding to total UV doses of 7511.6 and 8118.35 
J/m^, respectively. The remaining time was spent under standard laboratory illumination 
(16:8 h photoperiod). UV radiation was provided by an UV lamp (Spectroline XX15F/B, 
Spectronics Corporation, NY, USA, peak emission at 312 nm) and a cellulose acetate sheet 
was coupled to the lamp to cut-off UVC-range wavelengths |Gomes et al.||2015a| . Thirty 
two replicates per test condition (including control without UV radiation) were performed 
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to obtain 4 biological replicates for RNA extraction, each one with 40 organisms. After 
the exposnre period, the organisms were carefnlly removed from the water and frozen in 
liqnid nitrogen. The samples were stored at —80°C, nntil analysis. 


RNA Extraction, Labeling and Hybridization 


RNA was extracted from each replicate, which contained a pool of 20 and 40 organisms, 
for soil and water exposnre, respectively. Three biological replicates per test treatment 
(inclnding controls) were used. Total RNA was extracted using SV Total RNA Isolation 
System (Promega). The quantity and purity were measured spectrophotometrically with a 
nanodrop (NanoDrop ND-1000 Spectrophotometer) and its quality checked by denaturing 
formaldehyde agarose gel electrophoresis. 


500 ng of total RNA were amplihed and labeled with Agilent Low Input Quick Amp 
Labeling Kit (Agilent Technologies, Palo Alto, CA, USA). Positive controls were added 
with the Agilent one-color RNA Spike-In Kit. Purihcation of the amplihed and labeled 
cRNA was performed with RNeasy columns (Qiagen, Valencia, CA, USA). 


The cRNA samples were hybridized on custom Gene Expression Agilent Microarrays (4 
X 44k format), with a single-color design [Castro- Ferreira et ah 2014] . Hybridizations 
were performed using the Agilent Gene Expression Hybridization Kit and each biological 
replicate was individually hybridized on one array. The arrays were hybridized at 65 °C 
with a rotation of 10 rpm, during 17 h. Afterwards, microarrays were washed using 
Agilent Gene Expression Wash Buffer Kit and scanned with the Agilent DNA microarray 
scanner G2505B. 


Data Acquisition 

Fluorescence intensity data was obtained with Agilent Feature Extraction Software v. 10.7.3.1, 
using recommended protocol GEl_107_Sep09. Quality control was done by inspecting the 
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reports on the Agilent Spike-in control probes. 


Data Analysis 


Analyses were performed with R R Core Team, 2016 v. 3.3.1, using R packages plotrix 


Lemon, 2006 v. 3.6.3 and RColorBrewer Neuwirth, 2014 v. 1.1.2, and with Bioconductor 


Huber et al.| 2015 v. 3.3 packages affy Gautier et ah, 2004 v. 1.50.0, drosgenomel.db 


V .3.2.3, drosophila2.db v. 3.2.3, genehlter v. 1.54.2, and limma [Ritchie et ah, 2015 


V. 3.28.20. Background correction was carried out by Agilent Feature Extraction software 
for the real {E. crypticus) dataset, while the Affymetrix MAS5 algorithm, as implemented 
in the limma package, was used for the Golden and Platinum Spike datasets. 


To ensure an optimal comparison between the different normalization methods, only gene 
probes with good signal quality (flag IsPosAndSignif = True) in all samples were employed 
for the analysis of the E. crypticus dataset. This implied the selection of 18,339 gene 
probes from a total of 43,750. For the Golden and Platinum Spike datasets, data were 
considered as missing when probe sets were not called present by the MAS5 algorithm. 


The synthetic dataset without differential gene expression was generated gene by gene 
as normal variates with mean and variance equal, respectively, to the sample mean and 
sample variance of the expression levels for each gene, as detected from the real E. cryp¬ 
ticus dataset after SVGD normalization. The synthetic dataset with differential gene 
expression was generated equally, except for the introduction of differences in expression 
averages between treatments and controls. The magnitude of the difference in averages 
was equal, for each differentially expressed gene (DEG), to twice the sample variance. 
The percentage of DEGs for each treatment was chosen randomly, in logarithmic scale, 
from a range between 0.9% and 90%, while ensuring that 10% of genes were not differ¬ 
entially expressed across the entire dataset. One third of the treatments were mostly 
over-expressed (for each treatment independently, the probability of a DEG being over¬ 
expressed was O ~ 1 — |A/'(0,0.1^)|), one third of the treatments were mostly under¬ 
expressed (O ~ |A/'(0, 0.1^)|), and the remaining third had mostly balanced differential 
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gene expression (O ~ A/’(0.5, 0.1^)). For both synthetic datasets, the applied normal¬ 
ization factors were those detected by SVCD normalization from the real E. crypticus 
dataset. 

Median normalization was performed, for each sample, by subtracting the median of 
the distribution of expression levels, and then adding the overall median to preserve the 
global expression level. Quantile normalization was performed as implemented in the 
limma package. 

The two condition-decomposition normalizations, MedianCD and SVCD, proceeded in 
the same way: hrst, independent within-condition normalization for each experimental 
condition, using all genes. Then, one between-condition normalization, iteratively iden¬ 
tifying no-variation genes and normalizing until convergence of the set of no-variation 
genes. And hnally, another between-condition normalization using only the no-variation 
genes detected, to calculate the between-condition normalization factors. 

The criterion for convergence of MedianCD normalization was to require that the relative 
changes in the standard deviation of the normalization factors were less than 0.1%, or less 
than 10% for 10 steps in a row. In the case of SVCD normalization, convergence required 
that numerical errors were, compared to estimated statistical errors (see below), less than 
1%, or less than 10% for 10 steps in a row. Convergence of the set of no-variation genes was 
achieved by intersection of the sets found during 10 additional steps under convergence 
conditions. These default convergence parameters were used for all the MedianCD and 
SVCD normalizations reported, with the exception of MedianCD with the Golden Spike 
dataset, which used 30% (instead of 10%) of relative change for 10 steps in a row, to reach 
convergence. 

In SVCD normalization, the distribution of standard vectors was trimmed in each step to 
remove the 1% more extreme values of variance. 

Differentially expressed genes were identihed with limma analysis or t-tests, controlling 
the false discovery rate to be below 5%, independently for each comparison of treatment 
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versus control. 


The reference distributions with permutation symmetry shown in the polar plots of Sup¬ 
plementary Movies S1-S3 were calculated through the six possible permutations of the 
empirical standard vectors. The Watson statistic was calculated with the two-sample 
test 


Durbin, 1973 , comparing with an equal number of samples obtained by sampling 


with replacement the permuted standard vectors. 


Condition Decomposition of the Normalization Problem 


In a gene expression dataset with g genes, c experimental conditions and n samples per 
condition, the observed expression levels of gene j in condition k, = {y[f ,. • •, Unj)', 
can be expressed in log 2 -scale as 



(fc) , 

N + 


,{k) 


( 1 ) 


where is the vector of true gene expression levels and is the vector of normalization 
factors. 


Given a sample vector x, the mean vector is x = a:l, and the residual vector is x = x — x. 
Then, Q can be linearly decomposed into 









( 2 ) 

( 3 ) 


Equations ([^ dehne the within-condition normalizations for each condition k. The scalar 
values in ([^ are used to obtain the equations on condition means. 


y* = x; + a*. 


= X^. + a . 


( 4 ) 

( 5 ) 


The between-condition normalization is dehned by (@. Equations 0 reduce to a single 
number, which is irrelevant to the normalization. The complete solution for each condition 
is obtained with a^^) = a*^^^-|-a^*^). 





For full details about this condition-decomposition approach, see Supplementary Mathe¬ 
matical Methods in the Supplementary Information. 


Standard-Vector Normalization 


The n samples of gene j in a given condition can be modeled with the random vectors 
Xj,Yj G M"". Again, Yj = -|- a, where a is a fixed vector of normalization factors. 
It can be proved under fairly general assumptions (see Supplementary Information), that 
the true standard vectors have zero expected value 


X, 


E y/n — 1 = 0 , 


X, 


whereas the observed standard vectors verify, as long as a 7 ^ 0 , 


0 < E ( y/n — 1 
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This motivates the following iterative procedure to solve (|^ and (|^ {standard-vector 
normalization)-. 
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, for t > 0. 
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At convergence, limi^oo = 0 , which implies limi^oo = Xj and = a. 

Convergence is faster the more symmetric the empirical distribution of Xj/||xj|| is on the 
unit {n — 2)-sphere. Convergence is optimal with spherically symmetric distributions, 
such as the Gaussian distribution, because in that case 


E 



Aa, with 0 < A < E 


lY, 


(11) 
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Assuming no dependencies between genes, an approximation of the statistical error at 
step t can be obtained with 


EfllbC'll 






( 12 ) 


dt)| 


.=1 II 

This statistical error was compared with the numerical error to assess convergence. 


See Supplementary Mathematical Methods in the Supplementary Information for full 
details about this algorithm. See also Supplementary Movies S1-S3 for normalization 
examples. 


Identification of No-Variation Genes 

No-variation genes were identihed with one-sided Kolmogorov-Smirnov tests, as goodness- 
of-£t tests against the uniform distribution, carried out on a distribution of p-values. 
These p-values were obtained from ANOVA tests on the expression levels of genes, grouped 
by experimental condition. The KS test was rejected at a = 0.001. 

See Supplementary Mathematical Methods in the Supplementary Information for more 
details about this approach to identify no-variation genes. See also Supplementary Movies 
S4-S6 for examples of use. 
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Figure 1: MedianCD and SVCD normalization resulted in the detection of much larger 
between-condition variation in the datasets with differential gene expression, compared 
to Median and Quantile normalization. Panels show interquartile ranges of expression 
levels for the 153 samples, grouped by the 51 experimental conditions (Ag, blue-yellow; 
Cu, red-cyan; Ni, green-orange; UV, purple; see Supplementary Table SI). Black lines 
indicate medians. Rows and columns correspond to normalization methods and datasets, 
respectively, as labeled. 
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Figure 2: The selection of genes for the final between-condition normalization in Me- 
dianCD and SVCD normalization was crucial to preserve the variation between condi¬ 
tions. Panels show the detected variation as a function of the number of genes used in 
the between-condition normalization, for the real dataset (a), synthetic dataset with dif¬ 
ferential gene expression (b), and synthetic dataset without differential gene expression 
(c). Between-condition variation is represented as the standard deviation of the within- 
condition mean averages (averages of sample means, for all samples of the condition). See 
Supplementary Fig. SI for results using within-condition median averages, with similar 
behavior. Each point in each panel indicates the variation obtained with one complete 
normalization (black circles, MedianCD normalization; blue circles, SVCD normaliza¬ 
tion). Genes were selected in two ways: randomly (empty circles) or in decreasing order 
of p-values from a test for detecting no-variation genes (hlled circles). Big circles show 
the working points corresponding to the results depicted in Fig. Ij-o, which were chosen 
automatically. Black dashed lines show references for decays, with the same values 
in all panels. 
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Figure 3: MedianCD and SVCD normalization allowed to detect much larger numbers 
of differentially expressed genes (DEGs) in the datasets with differential gene expres¬ 
sion. Panels show results for the real dataset (a), synthetic dataset with differential gene 
expression (b), and synthetic dataset without differential gene expression (c). They dis¬ 
play the number of DEGs for each treatment compared to the corresponding control, 
obtained after applying the four normalization methods (empty black circles, Median 
normalization; empty red up triangles. Quantile normalization; hlled green circles, Me- 
dianGD normalization; hlled blue up triangles, SVGD normalization). For the synthetic 
dataset with differential gene expression (b), the numbers of treatment positives are also 
shown, as empty black down triangles. In each panel, treatments are ordered according 
to the number of DEGs identihed with SVGD normalization, increasing from left to right 
(see Supplementary Table 2, for real dataset). Differential gene expression was analyzed 
with R/Bioconductor package limma. Supplementary Fig. S2 shows results obtained with 
f-tests, qualitatively similar but with much lower detection of differential gene expression. 
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Figure 4: For the real dataset, MedianCD and SVCD normalization allowed to de¬ 
tect variation in gene expression of smaller magnitude than with Median and Quantile 
normalization. Boxplots display absolute values of DEG fold changes, for each treatment 
compared to the corresponding control, obtained after Median normalization (a). Quantile 
normalization (b), MedianCD normalization (c), and SVCD normalization (d). Boxplots 
are colored by treatment, with the same color code as in Figs. All panels have the same 
order of treatments as in Fig. |^, i.e. in increasing number of DEGs identified with SVCD 
normalization (Supplementary Table 2). Dashed horizontal lines indicate references of 
1.5-fold and 2-fold changes. 
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Figure 5: For the real dataset, detected differential gene expression was unbalanced, 
specially after using MedianCD and SVCD normalization and for the treatments with 
more DEGs (Fig. |^). Panels show the balance of differential gene expression, for each 
treatment compared to the corresponding control, obtained after Median normalization 
(a). Quantile normalization (b), MedianCD normalization (c), and SVCD normalization 
(d). Each point represents the balance of differential gene expression, i? (5 = 0, same 
number of over- and under-expressed genes; B = -|-1, all DEGs over-expressed; B = —0.5, 
75% DEGs under-expressed). Points are colored by treatment, with the same color code as 
in Figs. Bi All panels have the same order of treatments as in Figs.|^,|^ i.e. in increasing 
number of DEGs identified with SVCD normalization (Supplementary Table 2). Dashed 
horizontal lines indicate references for balanced differential gene expression (5 = 0). 
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Figure 6: In the synthetic dataset with differential gene expression, and with more 
than 10% of treatment positives (Fig. |^), Median and Quantile normalization resulted 
in less statistical power and uncontrolled false discovery rate. The panels display the true 
positive rate (a) and false discovery rate (b), for each treatment compared to the corre¬ 
sponding control, obtained after applying the four normalization methods (same symbols 
as in Fig. empty black circles. Median normalization; empty red up triangles. Quan¬ 
tile normalization; filled green circles, MedianCD normalization; filled blue up triangles, 
SVCD normalization). Both panels have the same order of treatments as in Fig. [^, i.e. 
in increasing number of differentially expressed genes identified with SVCD normaliza¬ 
tion. Differential gene expression was analyzed with R/Bioconductor package limma. The 
dashed horizontal line in (b) indicates the desired bound on the false discovery rate at 
0.05. 
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Figure 7: In the synthetic dataset with differential gene expression (Fig. |^, the un¬ 
balance between over- and under-expressed genes was a key factor in the lowered true 
positive rate and uncontrolled false discovery rate obtained after Median and Quantile 
normalization. Panels show the true positive rate (a) and false discovery rate (b) as a 
function of the balance of differential gene expression, B (5 = 0, same number of over- 
and under-expressed genes; B = -|-1, all DEGs over-expressed; B = —0.5, 75% DEGs 
under-expressed). Each point in both panels represents the results for one treatment 
compared to the corresponding control, obtained after applying the four normalization 
methods (same symbols as in Figs.[^|^ empty black circles. Median normalization; empty 
red up triangles. Quantile normalization; filled green circles, MedianGD normalization; 
filled blue up triangles, SVGD normalization). Differential gene expression was analyzed 
with R/Bioconductor package limma. The dashed horizontal line in (b) indicates the 
desired bound on the false discovery rate at 0.05. 
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Figure 8: In the Golden Spike dataset, the best detection of differential gene expression 
was achieved after using MedianCD and, specially, SVCD normalization. Panels display 
ROC curves, with the true positive rate versus the false positive rate (a), or the number 
of true positives versus the number of false positives (b). Each curve shows the results 
obtained after applying the four normalization methods plus Cyclic Loess normalization 
(same colors and symbols as in Figs. iillzl black curve with empty black circles. Median 
normalization; red curve with empty red up triangles. Quantile normalization; green 
curve with hlled green circles, MedianCD normalization; blue curve with hlled blue up 
triangles, SVCD normalization; magenta curve with hlled magenta diamonds. Cyclic Loess 
normalization). Dashed curves with lightly hlled symbols, overlapping the response of 
SVCD normalization, show results when the list of known negatives was provided to 
MedianCD, SVCD, and Cyclic Loess normalization. The two points per normalization 
method show results when controlling the false discovery rate (FDR) to be below 0.01 
(left point) or 0.05 (right point). Dashed lines in (b) show references for actual FDR 
equal to 0.01, 0.05, 0.1, 0.2, or 0.5 (from left to right). Compared to MedianCD and 
SVCD normalization, the other normalization methods resulted in notably more severe 
degradation of the FDR. 
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Figure 9 
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Figure 9: In the Platinum Spike dataset, all normalization methods resulted in similar 
detection of differential gene expression, with MedianCD and SVCD normalization being 
only marginally better. Panels display ROC curves, with the true positive rate versus the 
false positive rate (a), or the number of true positives versus the number of false positives 
(b). Each curve shows the results obtained after applying the four normalization methods 
plus Cyclic Loess normalization (same colors and symbols as in Figs. [6]-[^ black curve 
with empty black circles. Median normalization; red curve with empty red up triangles. 
Quantile normalization; green curve with hlled green circles, MedianCD normalization; 
blue curve with hlled blue up triangles, SVCD normalization; magenta curve with hlled 
magenta diamonds. Cyclic Loess normalization). Dashed curves with lightly hlled symbols 
show results when the list of known negatives was provided to MedianCD, SVCD, and 
Cyclic Loess normalization. As in Fig. the two points per normalization method show 
results when controlling the false discovery rate (FDR) to be below 0.01 (left point) or 
0.05 (right point). Dashed lines in (b) show references for actual FDR equal to 0.01, 0.05, 
0.1, 0.2, or 0.5 (from left to right). Compared to the Golden Spike dataset (Fig. [^, the 
diherence between normalization methods in the resulting degradation of the FDR was 
smaller for this dataset. 
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Supplementary Tables 


Supplementary Table SI: Experimental conditions of the toxicity experiment on E. 
crypticus, listed in the same order as they appear in each panel of Fig. 1, from left to 
right. 


Condition 

number 

Condition ID 

Condition description 

1 

Ag.AgNO3.EC20.3d 

AgN03 EC20 3 days 

2 

Ag.AgNO3.EC20.7d 

AgN03 EC20 7 days 

3 

Ag.AgNO3.EC50.3d 

AgN03 EC50 3 days 

4 

Ag.AgNO3.EC50.7d 

AgN03 EC50 7 days 

5 

Ag.Coated.EC20.3d 

Ag-NPs PVP-Coated EC20 3 days 

6 

Ag.Coated.EC20.7d 

Ag-NPs PVP-Coated EC20 7 days 

7 

Ag.Coated.EC50.3d 

Ag-NPs PVP-Coated EC50 3 days 

8 

Ag.Coated.EC50.7d 

Ag-NPs PVP-Coated EC50 7 days 

9 

Ag.NC.EC20.3d 

Ag-NPs Non-Coated EC20 3 days 

10 

Ag.NC.EC20.7d 

Ag-NPs Non-Coated EC20 7 days 

11 

Ag.NC.EC50.3d 

Ag-NPs Non-Coated EC50 3 days 

12 

Ag.NC.EC50.7d 

Ag-NPs Non-Coated EC50 7 days 

13 

Ag.NM300K.EC20.3d 

AgNM300K EC20 3 days 

14 

Ag.NM300K.EC20.7d 

AgNM300K EC20 7 days 

15 

Ag.NM300K.EC50.3d 

AgNM300K EC50 3 days 

16 

Ag.NM300K.EC50.7d 

AgNM300K EC50 7 days 

17 

Ag.CT.3d 

Ag Control 3 days 

18 

Ag.CT.7d 

Ag Control 7 days 

19 

Ag.CTD.3d 

Ag Control Dispersant 3 days 

20 

Ag.CTD.7d 

Ag Control Dispersant 7 days 

21 

Cu.CuNO3.EC20.3d 

CuN03 EC20 3 days 

22 

Cu.CuNO3.EC20.7d 

CuN03 EC20 7 days 

23 

Cu.CuNO3.EC50.3d 

CuN03 EC50 3 days 

24 

Cu.CuNO3.EC50.7d 

CuN03 EC50 7 days 

25 

Cu.Cu.NPs.EC20.3d 

Cu-NPs EC20 3days 

26 

Cu.Cu.NPs.EC20.7d 

Cu-NPs EC20 7days 

27 

Cu.Cu.NPs.EC50.3d 

Cu-NPs EC50 3 days 

28 

Cu.Cu.NPs.EC50.7d 

Cu-NPs EC50 7days 

29 

Cu.Cu.Nwires.EC20.3d 

Cu-NWires EC20 3 days 

30 

Cu.Cu.Nwires.EC20.7d 

Cu-NWires EC20 7 days 

31 

Cu.Cu.Nwires.EC50.3d 

Cu-NWires EC50 3 days 

32 

Cu.Cu.Nwires.EC50.7d 

Cu-NWires EC50 7 days 

33 

Cu.Cu.field.EC20.3d 

Cu-Field EC20 3 days 

34 

Cu.Cu.field.EC20.7d 

Cu-Field EC20 7 days 

35 

Cu.Cu.field.EC50.3d 

Cu-Field EC50 3 days 

36 

Cu.Cu.field.EC50.7d 

Cu-Field EC50 7 days 

37 

Cu.CT.3d 

Cu Control 3 days 

38 

Cu.CT.7d 

Cu Control 7 days 

39 

Ni.NiNO3.EC20.3d 

NiN03 EC20 3 days 

40 

Ni.NiNO3.EC20.7d 

NiN03 EC20 7 days 

41 

Ni.NiNO3.EC50.3d 

NiN03 EC50 3 days 

42 

Ni.NiNO3.EC50.7d 

NiN03 EC50 7 days 

43 

Ni.Ni.NPs.EC20.3d 

Ni-NPs EC20 3 days 

44 

Ni.Ni.NPs.EC20.7d 

Ni-NPs EC20 7 days 

45 

Ni.Ni.NPs.EC50.3d 

Ni-NPs EC50 3 days 

46 

Ni.Ni.NPs.EC50.7d 

Ni-NPs EC50 7 days 

47 

Ni.CT.3d 

Ni Control 3 days 

48 

Ni.CT.7d 

Ni Control 7 days 

49 

Uv.UV.D1.5d 

UV Dose 1 

50 

Uv.UV.D2.5d 

UV Dose 2 

51 

Uv.CT.5d 

UV Control 
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Supplementary Table S2: Treatment vs control comparisons, listed in increasing number 


of differentially expressed genes (DEGs), obtained for the real E. crypticus dataset with 
SVCD normalization and limma analysis. This is the same order as in Figs. 3a, 4, 5, from 
left to right. 


Comparison 

number 

Treatment ID 

Control ID 

Treatment description 

Number of 
DEGs 

1 

Ag.NM300K.EC20.7cl 

Ag.CTD.7d 

Ag NM300K EC20 7 days 

2 

2 

Ni.Ni.NPs.EC20.7d 

Ni.CT.7d 

Ni-NPs EC20 7 days 

7 

3 

Cu.CuNO3.EC50.7cl 

Cu.CT.7d 

CuN03 EC50 7 days 

26 

4 

Cu.CuNO3.EC20.3cl 

Cu.CT.3d 

CuN03 EC20 3 days 

27 

5 

Cu.Cu.NPs.EC20.3d 

Cu.CT.3d 

Cu-NPs EC20 3 days 

31 

6 

Ag.AgNO3.EC50.7d 

Ag.CT.7d 

AgN03 EC50 7 days 

33 

7 

Cu.Cu.NPs.EC20.7d 

Cu.CT.7d 

Cu-NPs EC20 7 days 

33 

8 

Ag.NM300K.EC20.3d 

Ag.CTD.3d 

Ag NM300K EC20 3 days 

38 

9 

Ag.NM300K.EC50.3d 

Ag.CTD.3d 

Ag NM300K EC50 3 days 

52 

10 

Ni.NiNO3.EC20.3d 

Ni.CT.3d 

NiN03 EC20 3 days 

74 

11 

Ni.NiNO3.EC20.7d 

Ni.CT.7d 

NiN03 EC20 7 days 

79 

12 

Ag.NC.EC20.7d 

Ag.CT.7d 

Ag-NPs Non-Coated EC20 7 days 

106 

13 

Ni.Ni.NPs.EC50.7d 

Ni.CT.7d 

Ni-NPs EC50 7 days 

107 

14 

Ag.AgNO3.EC20.7d 

Ag.CT.7d 

AgN03 EC20 7 days 

113 

15 

Ag.NC.EC50.7d 

Ag.CT.7d 

Ag-NPs Non-Coated EC50 7 days 

163 

16 

Ag.NC.EC20.3d 

Ag.CT.3d 

Ag-NPs Non-Coated EC20 3 days 

240 

17 

Ag.AgNO3.EC50.3d 

Ag.CT.3d 

AgN03 EC50 3 days 

260 

18 

Ag.Coated. EC20.7d 

Ag.CT.7d 

Ag-NPs PVP-Coated EC20 7 days 

261 

19 

Ni.NiNO3.EC50.7d 

Ni.CT.7d 

NiN03 EC50 7 days 

329 

20 

Cu.Cu.NPs.EC50.7d 

Cu.CT.7d 

Cu-NPs EC50 7 days 

343 

21 

Ag.Coated. EC50.7d 

Ag.CT.7d 

Ag-NPs PVP-Coated EC50 7 days 

346 

22 

Cu.Cu.Nwires.EC50.7d 

Cu.CT.7d 

Cu-NWires EC50 7 days 

383 

23 

Cu.CuNO3.EC20.7d 

Cu.CT.7d 

CuN03 EC20 7 days 

393 

24 

Cu.Cu.Nwires.EC20.7d 

Cu.CT.7d 

Cu-NWires EC20 7 days 

479 

25 

Cu.CuNO3.EC50.3d 

Cu.CT.3d 

CuN03 EC50 3 days 

522 

26 

Ag.AgNO3.EC20.3d 

Ag.CT.3d 

AgN03 EC20 3 days 

908 

27 

Ag.Coated. EC20.3d 

Ag.CT.3d 

Ag-NPs PVP-Coated EC20 3 days 

937 

28 

Ag.NM300K.EC50.7d 

Ag.CTD.7d 

Ag NM300K EC50 7 days 

1,264 

29 

Ni.Ni.NPs.EC20.3d 

Ni.CT.3d 

Ni-NPs EC20 3 days 

1,464 

30 

Cu.Cu.field.EC20.7d 

Cu.CT.7d 

Cu-Field EC20 7 days 

1,627 

31 

Ni.NiNO3.EC50.3d 

Ni.CT.3d 

NiN03 EC50 3 days 

1,647 

32 

Uv.UV.D2.5d 

Uv.CT.5d 

UV Dose 2 

1,864 

33 

Ni.Ni.NPs.EC50.3d 

Ni.CT.3d 

Ni-NPs EC50 3 days 

2,334 

34 

Cu.Cu.field.EC50.3d 

Cu.CT.3d 

Cu-Field EC50 3 days 

3,570 

35 

Cu.Cu.field.EC50.7d 

Cu.CT.7d 

Cu-Field EC50 7 days 

4,396 

36 

Uv.UV.D1.5d 

Uv.CT.5d 

UV Dose 1 

4,745 

37 

Cu.Cu.NPs.EC50.3d 

Cu.CT.3d 

Cu-NPs EC50 3 days 

5,988 

38 

Cu.Cu.field.EC20.3d 

Cu.CT.3d 

Cu-Field EC20 3 days 

9,225 

39 

Ag.Coated. EC50.3d 

Ag.CT.3d 

Ag-NPs PVP-Coated EC50 3 days 

9,478 

40 

Cu.Cu.Nwires.EC20.3d 

Cu.CT.3d 

Cu-NWires EC20 3 days 

9,751 

41 

Ag.NC.EC50.3d 

Ag.CT.3d 

Ag-NPs Non-Coated EC50 3 days 

9,884 

42 

Cu.Cu.Nwires.EC50.3d 

Cu.CT.3d 

Cu-NWires EC50 3 days 

10,285 
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Supplementary Figures 
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Supplementary Figure SI 
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Supplementary Figure SI: Representing between-condition variation as the standard 
deviation of the within-condition median averages (averages of sample medians, for all 
samples of the condition) produced similar results to those obtained with within-condition 
mean averages (Fig. 2). Panels show detected variation as a function of the number of 
genes used in the between-condition normalization, for the real dataset (a), synthetic 
dataset with differential gene expression (b), and synthetic dataset without differential 
gene expression (c). Labeling is the same as in Fig. 2. Each point in each panel indicates 
the variation obtained with one complete normalization (black circles, MedianCD normal¬ 
ization; blue circles, SVCD normalization). Gene were selected in two ways: randomly 
(empty circles) or in decreasing order of p-values from a test for detecting no-variation 
genes (hlled circles). Big circles show the working points corresponding to the results 
depicted in Fig. Ij-o. Black dashed lines show references for decays, with the same 
values in all panels. 
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Supplementary Figure S2 
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Supplementary Figure S2: With t-tests instead of limma analysis (Fig. 3a), MedianCD 
and SVCD normalization also allowed to detect larger numbers of differentially expressed 
genes (DEGs), compared to Median and Quantile normalization. Panels show the number 
of DEGs obtained for the real dataset (a), synthetic dataset with differential gene expres¬ 
sion (b), and synthetic dataset without differential gene expression (c). Symbols are the 
same as in Fig. 3 (empty black circles, Median normalization; empty red up triangles. 
Quantile normalization; hlled green circles, MedianGD normalization; filled blue up tri¬ 
angles, SVGD normalization; empty black down triangles, number of treatment positives 
(b)). In each panel, treatments are ordered according to the number of DEGs identified 
with SVGD normalization, increasing from left to right. 
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Supplementary Figure S3: In the Golden Spike dataset, and without restricting probe 
sets to those with signal in all samples (Fig. 8), MedianCD and SVCD normalization also 
allowed the best detection of differential gene expression. Both panels display ROC curves, 
with the true positive rate versus the false positive rate (a), or the number of true positives 
versus the number of false positives (b). Each curve shows the results obtained after 
applying the four normalization methods plus Cyclic Loess normalization (same colors and 
symbols as in Fig. 8; black curve with empty black circles. Median normalization; red curve 
with empty red up triangles. Quantile normalization; green curve with hlled green circles, 
MedianCD normalization; blue curve with hlled blue up triangles, SVCD normalization; 
magenta curve with hlled magenta diamonds. Cyclic Loess normalization). Dashed curves 
with lightly hlled symbols, overlapping the response of SVCD normalization, show results 
when the list of known negatives was provided to MedianCD, SVCD, and Cyclic Loess 
normalization. The two points per normalization method show results when controlling 
the false discovery rate (FDR) to be below 0.01 (left point) or 0.05 (right point). Dashed 
lines in (b) show references for actual FDR equal to 0.01, 0.05, 0.1, 0.2, or 0.5 (from 
left to right). As in Fig. 8, compared to MedianCD and SVCD normalization, the other 
normalization methods resulted in notably more severe degradation of the FDR. 


66 




0 200 400 600 800 

# False positives 


Supplementary Figure S4 


67 









Supplementary Figure S4: In the Golden Spike dataset, and with t-tests instead of limma 
analysis (Fig. 8), MedianCD and SVCD normalization also allowed the best detection of 
differential gene expression. Both panels display ROC curves, with the true positive 
rate versus the false positive rate (a), or the number of true positives versus the number 
of false positives (b). Each curve shows the results obtained after applying the four 
normalization methods plus Cyclic Loess normalization (same colors and symbols as in 
Figs. 8, S3; black curve with empty black circles. Median normalization; red curve with 
empty red up triangles. Quantile normalization; green curve with hlled green circles, 
MedianCD normalization; blue curve with hlled blue up triangles, SVCD normalization; 
magenta curve with hlled magenta diamonds. Cyclic Loess normalization). Dashed curves 
with lightly hlled symbols, overlapping the response of SVCD normalization, show results 
when the list of known negatives was provided to MedianCD, SVCD, and Cyclic Loess 
normalization. The two points per normalization method show results when controlling 
the false discovery rate (FDR) to be below 0.01 (left point) or 0.05 (right point). Dashed 
lines in (b) show references for actual FDR equal to 0.01, 0.05, 0.1, 0.2, or 0.5 (from left to 
right). Compared to results obtained with limma analysis (Figs. 8, S3), the degradation 
of FDR was slightly less severe with t-tests. MedianCD and SVCD normalization resulted 
again in the least degradation of the FDR. 
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Supplementary Figure S5 
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Supplementary Figure S5: In the Platinum Spike dataset, and without restricting probe 
sets to those with signal in all samples (Fig. 9), all normalization methods resulted in 
similar detection of differential gene expression, with the exception of Cyclic Loess nor¬ 
malization (magenta curve/symbols), whose number of detected positives was slightly 
smaller. Both panels display ROC curves, with the true positive rate versus the false 
positive rate (a), or the number of true positives versus the number of false positives 
(b). Each curve shows the results obtained after applying the four normalization meth¬ 
ods plus Cyclic Loess normalization (same colors and symbols as in Fig. 9; black curve 
with empty black circles. Median normalization; red curve with empty red up triangles. 
Quantile normalization; green curve with hlled green circles, MedianCD normalization; 
blue curve with hlled blue up triangles, SVCD normalization; magenta curve with hlled 
magenta diamonds. Cyclic Loess normalization). Dashed curves with lightly hlled sym¬ 
bols show results when the list of known negatives was provided to MedianCD, SVCD, 
and Cyclic Loess normalization. The two points per normalization method show results 
when controlling the false discovery rate (FDR) to be below 0.01 (left point) or 0.05 (right 
point). Dashed lines in (b) show references for actual FDR equal to 0.01, 0.05, 0.1, 0.2, 
or 0.5 (from left to right). As in Fig. 9, the diherence between normalization methods 
in the resulting degradation of the FDR was smaller for this dataset than for the Golden 
Spike dataset (Figs. 8, S3, S4). 
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Supplementary Figure S6 
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Supplementary Figure S6: In the Platinum Spike dataset, and with t-tests instead 
of limma analysis (Fig. 9), all normalization methods resulted in similar detection of 
differential gene expression, with MedianCD and SVCD normalization being marginally 
better. Both panels display ROC curves, with the true positive rate versus the false 
positive rate (a), or the number of true positives versus the number of false positives 
(b). Each curve shows the results obtained after applying the four normalization methods 
plus Cyclic Loess normalization (same colors and symbols as in Figs. 9, S5; black curve 
with empty black circles. Median normalization; red curve with empty red up triangles. 
Quantile normalization; green curve with hlled green circles, MedianCD normalization; 
blue curve with hlled blue up triangles, SVCD normalization; magenta curve with hlled 
magenta diamonds. Cyclic Loess normalization). Dashed curves with lightly hlled symbols 
show results when the list of known negatives was provided to MedianCD, SVCD, and 
Cyclic Loess normalization. The two points per normalization method show results when 
controlling the false discovery rate (FDR) to be below 0.01 (left point) or 0.05 (right 
point). Dashed lines in (b) show references for actual FDR equal to 0.01, 0.05, 0.1, 0.2, 
or 0.5 (from left to right). Compared to results obtained with limma analysis (Figs. 9, 
S5), and in contrast to the Golden Spike dataset (Figs. 8, S3, S4), the degradation of the 
FDR was slightly more severe with t-tests in this dataset. 
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Legends of Supplementary Movies 


Supplementary Movie SI. Example of one within-condition Standard-Vector normal¬ 
ization, for the real {E. crypticus) dataset. The movie shows the 14 steps of the conver¬ 
gence of Standard-Vector normalization performed for the condition Ag.NM300K.EC20.3d 
(exposure to Ag NM300K nanoparticles, with an EC 50 dose for three days). Left panels 
show a subset of 10,000 randomly-chosen sample standard vectors, with one gray line 
per gene, in the plane of residual vectors, i.e. the plane perpendicular to the vector of 
coordinates (1,1,1). The lines labeled sl-s3 indicate the projection of the axes onto this 
plane, the number 1-3 being the sample number. The red line is the estimated vector 
of normalization factors at each step, with length ||offset||, which results from the bias 
of the standard vectors towards that direction. Right panels show the polar distribution 
of vector angles (black solid curve), compared to the distribution of vector angles after 
all six possible permutations of the sample labels (blue dashed curve). The Watson 
statistic provides a measure of the difference between both distributions. In the initial 
step, there is a large bias towards the hrst and second sample, compared to the third one. 
The bias is reduced in each step by subtracting the normalization factor estimate, which 
makes the distribution of standard vectors more permutationally symmetric and with a 
correspondingly smaller U'^. After convergence in 14 steps, there is no detectable bias 
left. 

Supplementary Movie S2. Example of one within-condition Standard-Vector normal¬ 
ization, for the synthetic dataset without differential gene expression. The movie displays 
the Standard-Vector normalization performed for the condition Ag.NM300K.EC20.3d, 
on the synthetic dataset generated with the standard normal A/'(0,1) as base distribu¬ 
tion. Format and labels are the same as in Supplementary Movie SI. Note the uniform 
distribution of angles after normalizing, which corresponds to a parametric family of prob¬ 
ability distributions with spherical symmetry. The corresponding movie for the synthetic 
dataset with differential gene expression is virtually identical, given that standard vectors 
are independent of sample averages. 
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Supplementary Movie S3. Example of one within-condition Standard-Vector normal¬ 
ization, for synthetic log-normal data. The movie shows the standard-vector normalization 
performed for the condition Ag.NM300K.EC20.3d, on a synthetic dataset generated in 
the same way as that of Supplementary Movie S2, except for using as base distribution the 
log-normal logA/'(0,0.5^), which has large positive skewness (~ 1.75). Format and labels 
are the same as in Supplementary Movies SI, S2. Note that the distribution of standard 
vector angles after normalizing is not uniform, but it has permutation symmetry. 

Supplementary Movie S4. Identihcation of no-variation genes (non-differentially ex¬ 
pressed genes) for the real {E. crypticus) dataset. The movie shows the 27 steps of the 
corresponding between-condition normalization, with SVCD normalization. Both panels 
show the empirical distribution function of p-values obtained from ANOVA tests on ex¬ 
pression levels, per gene and grouped by experimental condition. The left panel shows the 
complete interval [0,1], while the right panel depicts the interval close to 1 where the hrst 
goodness-of-fit (GoF) test was not rejected. The black portion of the distribution corre¬ 
sponds to p-values at which the GoF test was rejected, the big black circle indicates the 
hrst p-value at which the GoF test was not rejected, and the red portion shows the range 
of p-values whose genes, as a result, were identihed as no-variation genes. The dashed blue 
line and the dotted blue line indicate, respectively, the theoretical distribution function 
of the uniform distribution and the threshold of the one-sided Kolmogorov-Smirnov test 
(a = 0.001, n equal to the number of p-values for the hrst GoF test that was not rejected). 
Gonvergence criteria was met from steps 18 to 27. These last ten steps ensured stability 
of the detected set of no-variation genes, by cumulative intersection of the successive sets 
identihed, each one with no-variation genes, as shown. The resulting hnal set had 
974 no-variation genes. 

Supplementary Movie S5. Identihcation of no-variation genes for the synthetic dataset 
with diherential gene expression. The movie shows the 15 steps of the corresponding 
between-condition normalization, with SVGD normalization. Format and labels are the 
same as in Supplementary Movie S4. Note the similarity with the behavior observed for 
the real dataset (Supplementary Movie S4). 
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Supplementary Movie S6. Identification of no-variation genes for the synthetic dataset 
without differential gene expression. The movie shows the 14 steps of the corresponding 
between-condition normalization, with SVCD normalization. Format and labels are the 
same as in Supplementary Movies S4, S5. Note that the distribution of p-values at 
convergence (steps 5-14) is uniform in the whole interval [0,1], up to the level detected 
by the goodness-of-£t test. This corresponds to a dataset with no differentially expressed 
genes. 
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Supplementary Mathematical Methods 


Contents 

51 Vectorial representation of sample data. [75] 

52 Linear decomposition of the normalization problem. [78] 

53 Permutation invariance of multivariate data. [83] 

54 Standard-vector normalization. [89] 

55 Identification of no-variation genes . [H] 

SI Vectorial representation of sample data 

Let xi^... ,Xn be the samples of n independent and identically distributed random vari¬ 
ables Xi,... ,X„. Let us represent the samples Xi,... with the M"' column vector 
X = (xi,..., x„)', and let us denote the sample mean by x = ^i/n. 

Let us dehne the M"’ —)■ vectorial operators mean (~) and residual (~), respectively, as 

X = (x, ... ,x)' = xl, (13) 

X = X —X = X — xl, (14) 

1 being the all-ones column vector of dimension n. 

Thus, any sample vector x G can be decomposed as 

X = x-l-x. (15) 

The mean vector x contains the sample mean, while the residual vector x carries the 
sample variation around the mean. 
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The vectorial operators mean (13) and residual (14) are linear. 


Proposition. For any two sample vectors x, y G M"" and any two numbers a, /? G M, 


ax + /3y = 

ax + /3y, 

(16) 

ax + /dy = 

ax + fy. 

(17) 


Proof. Let us denote x = {xi ,..., Xn)' and y = (|/i,..., ?/„)'. 


n 


n 


n 


= ax + /3y, 


ax +/dy = ax +/3y — ax +/dy = ax +/dy — (ax +/dy), 
= a(x-x)+/?(y-y) = ax +/?y. □ 


An essential property of the mean and residual vectors is that they belong to subspaces 
that are orthogonal complements [Eaton 2007 . Hence, for any sample vector x G M", 
the mean vector x belongs to the subspace of dimension 1 spanned by the unit vector 


1 = 1/^/n, while the residual vector x belongs to the (n — l)-dimensional hyperplane 
orthogonal to 1. 


The lengths of the mean vector and residual vector are equal, up to a scaling factor, to the 
sample mean and sample standard deviation, respectively. For a set of samples xi,..., x„, 
where n > 2, let us denote the sample mean as before by x = the sample 

variance as ~ x)^/(u — 1). Then, the lengths of the mean and residual 

vectors obtained from the sample vector x = (xi,..., Xn)' are 


= Vnx^ = vu|x| 


7.2 — 


X = 


. ^ Vi -1 

N 


(18) 

(19) 


Finally, let us dehne the standard vector of the sample vector x = (xi,... ,Xn)' (n > 2), 
as 

stdvec(x) = A/n^^TT—, (20) 
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whenever x 7^ 0, or otherwise as stdvec(x) = 0. 0 is the all-zeros column vector of 
dimension n. 

For a given number of samples n, all the non-zero standard vectors belong to the (n — 2)- 
sphere of radius y/n — 1, embedded in the (n — l)-dimensional hyperplane perpendicular 
to 1. Besides, all the components of a standard vector are equal to the corresponding 
standardized samples, 

i-Y> . i-Y> , _ ry* 

( 21 ) 

l|x|| 

For the degenerate case of having only two samples (n = 2), the only possible values of a 
non-zero standard vector are ±( 1/^2, —l/\/2 )'. 


S2 Linear decomposition of the normalization problem 


Let us consider a gene expression dataset, with g genes and c experimental conditions. 
Each condition k has Sk samples. The total number of samples is s = Ylk=i 

Let us denote the observed expression level of gene j in the sample i of condition k by yb . 
We assume that the observed level is equal, in the usual log 2 -scale, to the addition of 
the normalization factor of'* to the true gene expression level 



(k) (k) 

Aj + “i ■ 


( 22 ) 


ik) 

Solving the normalization problem amounts to finding the normalization factors a) ^ from 

the observed values The normalization factors can be understood as sample-wide 

(^) 

changes in the concentration of mRNA molecules by multiplicative factors equal to 2“^ . 
These changes are caused by technical reasons in the assay and are independent of the 
biological variation in the true levels x[j\ 

Let us represent the true and observed expression levels, x'lj'^ and y^j\ of gene j in the 
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samples i = 1 ... of condition k, by the Sfc-dimensional vectors 

„(fc) ^ y 

7 V "''1.7 ’ • • • ’ "''Sfci 1 5 


(fc) 

y = 


(k) 

Vlj ) • • • ) l/sfej 


(fc) V 


(23) 

(24) 


Let us also represent the unknown normalization factors of condition k by the s^-dimensional 
vector 

a« = (25) 


From (22)-(25), the normalization problem can be written in vectorial form as 


yf = xf> + a(‘>. 


(26) 


Applying the vectorial operators mean (13) and residual (14), we obtain 


-{k) 

y) = 

~{k) 

y = 


xf+ sw. 


(27) 

(28) 


The residual-vector equations (28) correspond to the c within-condition normalizations. 


Each within-condition normalization uses the equations (28) particular to a condition k, 


for the subset of genes Qk ^ {!) • • • > S'} that have expression level available and of enough 
quality in that experimental condition. 


Let us denote the condition means for each gene as 


so that 


A!^) _ 

Z^i=i -^ij 

'j 

^k 


Ak) _ 

(k) 

Z^i=i yij 

'j 



jW - 

„{k) 

Z^i=l “i 


'^k 


_(fc) 

= 1 

Sk 5 

_(fc) 

y} 

-{k)t 

= y) A 

Sfe ’ 


= 

Sk ’ 


(29) 

(30) 

(31) 


(32) 

(33) 

(34) 
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Isj. being the all-ones column vector of dimension Sk- 


Then, the mean-vector equations (27) can be written as 




so they reduce to the scalar equations 


—(k) 

V) 


sf>+s(*>. 


(35) 


(36) 


Let us dehne the vectors of conditions means as 




y.' = 

a* = 






.yfy. 


(37) 

(38) 

(39) 


and let us express the condition-mean equations in vectorial form as 


y, = X,. + a . 


(40) 


Applying again the mean and variance operators, we obtain 


y; = x; + r, 

y* = + 


(41) 

(42) 


The residual-vector equations on condition means (42) correspond to the single between 


condition normalization, in a similar way as (28) do for the each of the within-condition 


normalizations. There is one equation (42) per gene. The only equations used in the 


between-condition normalization are those of the subset of genes Q* C { 1 ,... , 51 } that 
show no evidence of variation across experimental conditions, according to a statistical 
test. 


Given that a* = a*lc, (41) has the only unknown a*. The meaning of a* is a conversion 


factor between the scale the true and observed expression levels. This factor depends on 
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the technology used to measure the expression levels and hnding it is out of the scope of 
the normalization problem. Therefore, without loss of generality, we assume a* = 0, so 


r = Oe, 


a = a . 


( 43 ) 

( 44 ) 


The solution of the between-condition normalization, a*, allows to hnd the mean vectors 
of the normalization factors via (34), (39) and (44). The within-condition normaliza¬ 


tions yield the residual vectors a^^^. The complete solution to the normalization problem 
is hnally obtained, for each condition k, with 


,W ^ gW+SW. 


(45) 


Thus, the original normalization problem (26) has been divided in c-f-l normalization sub 


problems on residual vectors, stated by (28) and (42). In fact, this linear decomposition 


is possible for any partition of the set of s samples. The choice of the partition as the one 
dehned by the experimental conditions is motivated by the need to control the biological 
variation among the genes used in each normalization. All the c-f-l normalizations face 
the same kind of normalization of residuals problem, which we dehne in general as follows. 

Normalization of Residuals Problem. Let yij be the Tth observed value of feature 
j, in a dataset with n >2 observations for each of the m features. The observed values 
Hij are equal to the true values Xij plus the normalization factors a*, which are constant 
across features. In vectorial form, there are m equations 


Yj = Xj+a, 


where the vectors belong to M". As a consequence 


( 46 ) 


Vi = Xj + a. 


( 47 ) 


Solving the normalization of residuals problem amounts to hnding the residual vector of 
normalization factors a from the observed residual vectors jj. In the within-condition 
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normalizations, the features are gene expression levels, with one observation per sample 
of the corresponding experimental condition. In the between-condition normalization, the 
features are means of gene expression levels, with one observation per condition. 


There is, however, an additional requirement imposed by the methods with which we 
propose to solve the between-condition normalization. We would like to consider the 
condition means in (36) as sample data across conditions. This only holds when all 


the conditions have the same number of samples. Otherwise, we balance the condition 
means so that they result from the same number of samples in all conditions, according 
to the procedure described in the following. 

Let s* be the minimum number of samples across conditions, s* = min{si,... ,Sc}. Let 
be independent random samples (without replacement) of size s* from the set of 
indexes {!,..., s^}, with one sample per gene j and condition k. Then, the balanced 
condition means are dehned as 

Z 4‘’ 


— (k)* 


_ (k)^ 

y) 


i&S, 


(fe) 




i&S, 


(k) 


Y 


Ak) 


-{k)* 

= — 


(fe) 


(48) 


(49) 


(50) 


From (22), the balanced condition means verify a relationship similar to (36), 

_ (k)* , _(/c)* 

y) = ■ 


(51) 


Moreover, the average of across the sampling subsets is equal to the unknown 
This implies that (51) are, on average, equivalent to (36). Hence, we use the following 


vectors of balanced conditions means 


= 


TfA) 

..,Xj 

y.* = 

(nr- 

-(c) = 




(52) 

(53) 
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instead of (37), (38), in order to bnild the condition-mean eqnations (40). This balancing 


of the condition means is only reqnired when the experimental conditions have different 
nnmber of samples. 


S3 Permutation invariance of multivariate data 


Let Xij and i/ij be, respectively, the trne and observed valnes of a dataset with n observa¬ 
tions of m featnres, as dehned in the normalization of residuals problem above. 


We have assnmed that the n trne valnes xij ,..., Xnj of featnre j are samples of independent 


and identically distribnted random variables Xij,... ,Xnj- These random variables can 
be represented with the random vector Xj = {Xij,... ^Xnj)', carried by the probability 


space (f2, P) and with indnced space 


). Let ns dehne the random vectors X, 


and X,' with the vectorial operators mean (13) and residnal (14), respectively. 


li -y 

y = Y.— 


i=l 


n 


X, = (A',, ,,, ,X,)' = A,l, 


y = X, - y = X, - x,i. 


(54) 

(55) 

(56) 


Xj = Xj -|- Xj holds for any random vector X^, as well as the other properties presented 
above. Let ns assnme that E( ||Xj|| ) < oo and that P( ||Xj|| = 0) = 0, which imply that 
Xj/||Xj|| has length 1 almost snrely. 

The standard random vector \/n — 1 Xj/||Xj|| is a pivotal qnantity, where the location 
(mean) and scale (standard deviation) of featnre j have been removed. The probability 
distribntion of Xj/||Xj|| across the remaining degrees of freedom over the nnit {n — 
2 )-sphere is governed by the parametric family of the random variables Xij,... ,Xnj- 
Moreover, the independence and identity of distribntion across the n observations implies 
that the distribntion of X^ is exchangeable, i.e. invariant with respect to permntations 
of the observation labels. As a resnlt, Xj/||Xj|| is also permntation invariant, which 
geometrically corresponds to symmetries with respect to the n\ permntations of the axes 
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in the n-dimensional space of random vectors, projected onto the (n — 1)-dimensional 
hyperplane of residual vectors. 


Residual vectors and standard vectors have been widely studied, especially in relation 


to elliptically symmetric distributions and linear models Fang et ah, 1990 Gupta et al, 


2013 , and to the invariances of probability distributions Kallenberg, 2005 . Here, we 


consider these vectors from the viewpoint of the problem of normalizing multivariate 
data, and its relationship with permutation invariance. 


It is well know that, for a multivariate distribution with independent and identically 


distributed components, the expected value of the standard vector is zero Eaton, 2007 


given that it is so for each component. We prove this here for completeness, and to show 
that it is also a necessary consequence of the permutation invariance of the distribution. 


Proposition. The expected value of any true (i.e. without normalization issues) stan¬ 
dard vector is zero. If the n > 2 samples of feature j are independent and identically 
distributed, then 


E 




0 . 


(57) 


Proof. Let Vn be the set of all the permutation matrices in Then, for any P G Vn, 

X,/||X,|| is equal in distribution to PXj/||Xj||. This implies that 


E 





The only vectors that are invariant with respect to all possible permutations are those 
that have all components identical. Therefore, E(Xj/||Xj|| ) = al, with a G M. However, 
X;.l = 0, so that a = E(Xj/||Xj||)'! = 0. Hence E(X^/||Xj||) = 0 . □ 

For each true random vector X^, there is an observed random vector Yj = Xj -|- A, 
where A is the random vector of normalization factors. The random vectors Xj and A 
are independent, representing biological and technical variation, respectively. Therefore, 
and without loss of generality, we assume in what follows a hxed vector of normalization 
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factors a, i.e. we condition on the event {A = a }. We also assume that P( || Yj || = 0 ) = 0, 
which implies that Yj/||Yj|| has length 1 almost surely. 

In contrast to the true standard vector v^n^^Xj/||Xj||, the observed standard vector 
v/^^Y,/||Y,|| is biased toward the direction of a, with the result that the expected 
value is not zero. 


Proposition. If the n>2 samples of feature j are independent and identically distributed, 
whenever a 7 ^ 0, 

E I ^ 0 . (58) 

V iiY.ii; 

When n = 2, there is the additional requirement that P( ||Xj|| < ||a|| ) > 0. This threshold 
of detection only occurs for the degenerate case of n = 2 . 


Proof. Let us consider the projection of Yj/HYjH on a, compared to the projection of 

X7IIX,I|. 


When the vectors X, and a are collinear, 


X'5 Y'5 

= ± 1 , and ^ _ = ± 1 , 


|X,|| ||S 


|Y,|| ||S| 


with 


Y'5 X'S 

J _ ^ _ J 


|Y,|| ||5| 


|X,|| ||5| 


This is the only case when n = 2. The additional requirement ensures that, for n = 2, 


Y'5 X'5 , 

^ - 1 > 0, 


|Y,|| ||5| 


|X,|| ||5|| 


which implies 


E 


Y'S 


|Y,|| ||5|| 


> E 


X'S 


|X,|| ||5| 


Otherwise, when n > 2 and the vectors X^ and a are not collinear, they lie on a plane. 
The vector Yj = Xj + a is the diagonal of the parallelogram dehned by X^ and a. Hence 
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the angle between Yj and a is strictly less than the angle between Xj and a, so the cosine 
of the angle is strictly greater. Thus, 

Y'5 X'S 

_£_ ^ _ 

l|Y,lll|S|| l|X,l|l|Sir 


Due to the permutation symmetries in the distribution of Xj/||Xj||, when n > 2 the 
vector Xj has non-zero probability of being not collinear with a, i.e. P( |X'-a| < 1 ) > 0. 


Therefore, 


which again implies 




> 0 , 



Finally, 


E 




0 . □ 


As a consequence, the normalization of residuals problem may be restated as the problem 
of Ending the normalization factors a from the observed vectors yj, such that the standard 
vectors \/n — 1 (y^ — a)/||yj — a|| are invariant against permutations of the observation 
labels. Or equivalently, such that the standard vectors \/n — 1 (jj — a)/||yj — a|| have 
zero mean. The following property provides an approach to the solution. 


Proposition. Whenever a 7 ^ 0, the component of the expected value of Yj/||Yj|| parallel 
to a verifies 


0 < E ' 


|Y,I 


< E 


lY. 


(59) 


As in (58), when n = 2 we also assume that P( ||Xj|| < ||a|| ) > 0. 


Proof. The first inequality holds from the previous proof. Concerning the second inequal¬ 
ity, let us consider 





























We need to prove that the hrst term on the RHS has negative expected value. Let us 
decompose this term into the positive and negative parts, 

jx,l| X'S ^ / JXjll 35 y _ I jx,|| 'X;g y 

||x, + sii !|x,||||S!i hix,+s|| ||x,ii||S|d d|x,+sii ||x,||iisii/ ’ 

where X~^ = max(X, 0) and X~ = — min(X, 0). 

Because HXj + ajp = \\Xjf + ||a||2 + 2Xta, 



These inequalities are identities when X'- a is of opposite sign to (■ )^, or when X'- a = 0. 
Because of the permutation symmetries of Xj/||Xj||, it follows that P( X'- a 7 ^ 0) > 0, 
which implies 



For any permutation matrix P G Vn, 



, surely, 

dllFX^IP + IISIP 

X ■ 

P —:=J— in distribution, 

l|X,|| 
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so that 


|X,| 


X,- 

= p 


|X,| 


|X,||2 + ||S' 
which together with 


12 IIX. 


|XJ|2 + ||5| 


X 

in distribution, 

iXdl 


IX, 


X, 


|Xd|2 + ||S| 


IX, 


1 = 0 surely, 


implies, as in (57), that 


E 


Kill Xj , _ Q 


|X,|P+ llaip IKi 


Therefore, 


+ 


E 


IX, 


X'5 


|Xd|2 + ||5| 


IX, 


= E 


IX, 


X'S 


|X,||2 + ||5||2 llXjll I|a|| 


Back to the initial expected values, it follows that 


e (( j^iii 3" y\, j ( j^^ii 3^ ) 

VVI|K,+S|II|X, inlaid J VVI|Xi+S||||X,||||a||/ 


which implies 


E 


jx,l| 3 a \ 

X)j -)- a 1 1 II X)j' II 1 1 a 1 1 J 


< 0. □ 


The Gaussian multivariate distribution, among others, has spherical symmetry besides 
permutation symmetry. For parametric families with spherical symmetry, the true stan¬ 
dard vector -\/n^^Xj/||Xj|| has uniform distribution over the (n — 2 )-sphere. As a result, 
the components of Yj/||Yj|| perpendicular to a are antisymmetric with respect to the di¬ 
rection of a, so that they cancel out in expectation. That is, for parametric families with 
spherical symmetry, and as long as a 7 ^ 0, 

^ = Aa, with 0 < A < E 
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E 


(60) 

























S4 Standard-vector normalization 


The properties (59), (60) suggest the use of 


m 



( 61 ) 


to approximate the unknown residual vector of normalization factors a. The following 
iterative method implements this approach to solve the normalization of residuals problem. 


Let us dehne the following recursive sequence, where each step t comprises m vectors 
{j G {1 ,..., m}) and one vector 


= for t > 1, 


-(0) 

= yii 



y.)' 

- yj 


m 

E 

gw 

i=l 

m 


E 

Z=1 






for f > 0. 


d*)i 


( 62 ) 

( 63 ) 


( 64 ) 


We assume that ^ 0„, for all j G {1,... ,m} and all f > 0. Nonetheless, an imple¬ 
mentation of this algorithm benehts from trimming out a small fraction (e.g. 1%) of the 


features with lesser ||yj*^|| in (64), in order to avoid numerical singularities. 











Let us write as a function of the unknowns Xj and a. For any f > 1, 


-it) 


Note that (70) is also valid for t = 0. 


so that, by (70), for f > 0, 


= 

(65) 

= yh-2) _ b(*-2) _ 

(66) 

(67) 

= yr-5:E”. 

t-1 

(68) 

= yj- EE”. 

r=0 

t-1 

(69) 

= 5^bC). 

r=0 

= 0. 

(70) 

', for f > 0, which describe the vector of normalization 

) t, 

t-1 

aW = S - ^bC), 

r=0 

(71) 

yf = % + a“>. 

(72) 


Therefore, the recursive sequence (62)-(64) faces a new, weaker normalization of residuals 
problem at each step t, with true residual vectors x^, observed residual vectors yj*^ and 
unknown normalization factors The step t results in the estimation of normalization 
factors which are removed from yf\ generating the next step. At the beginning, 
= yj and a^^^ = a. 


At convergence, hmt_,.oo = 0. Equations (57), (58), (64) imply that, in such a case, 
hmi_,.oo yf^ = Xj and Convergence is optimal when the parametric family 

of the m features has spherical symmetry, Gaussian being the most prominent case. Oth¬ 
erwise, the more uniform the distribution of standard vectors \/n — 1 Xj/||xj|| is on the 


{n — 2)-sphere, the faster the sequence (62)-(64) converges. See examples of convergence 
in Supplementary Movies S1-S3. 
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S5 Identification of no-variation genes 


Let us consider a gene expression dataset, with g genes and c experimental conditions. 
Each condition k has Sk samples. The total number of samples is s = 
assume that c > 2 and that Sk > 2, for all conditions k G {1,..., c}. Let us also assume 
that, among the g genes, there is a fraction tto of non-differentially expressed genes (non- 
DEGs), with 0 < TTo < 1, while the remaining fraction 1 — vto comprises the differentially 
expressed genes (DEGs) (Storey and Tibshirani 2003 . 


Let us consider the usual ANOVA test comparing average expression levels across con¬ 
ditions, gene-by-gene. Under the null hypothesis of a non-DEG, the corresponding F- 
statistic follows the F-distribution with c — 1 and s — c degrees of freedom. The test of 
this hypothesis yields a p-value pj for each gene j G {1,..., g}. The obtained p-values pj 
follow a probability distribution that can be considered as the mixture of two probability 


distributions, Fq and Fi, for the non-DEGs and the DEGs, respectively Storey, 2003 


The fraction tto of non-DEGs follows the uniform distribution on the interval [0,1], 

Fo(p) = P, 


( 73 ) 


while the fraction 1 — tto of DEGs follows a distribution that verifies, for any p G (0,1), 


Flip) > P, 


(74) 


and the mixture distribution is 


F(p) = iPiFriip) + (1 - !ro)Fi(p). 


( 76 ) 


Let us further assume that there exists a p*, with 0 < p* < 1, such that Fi{p) = 1 for 

every p > p*. In other words, all DEGs have p-value pj from the ANOVA test such that 

Pj < P*, while only some genes among the non-DEGs have p-value with pj > p*. This 
implies that the mixture distribution of p-values is uniform on the interval [p*, 1], 

F(p) = TTop-Fl-TTo, forp*<p<l, (76) 

/(p) = TTo, for p* < p < 1. (77) 
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On the other hand, for any set of n samples a;(i) < X( 2 ) < • • • < X(n) obtained from n 
independent and identically distributed uniform random variables on the interval [a, b], all 
the distances between consecutive ordered samples (including boundaries), X(i) — a, X( 2 ) — 



realized that, for any j such that 2 < j < n — 1, the two subsets of samples a;(i),..., 
and a;(j+i),... ,X(^n) follow uniform distributions on the intervals [ci,a;(j)] and [x(^j),b], re¬ 
spectively. 

Based on these facts, to identify no-variation genes we propose hnding the minimum 
from the ordered sequence of p-values p(i) < p( 2 ) < ■ ■ ■ < P(g), such that a goodness-of-£t 
test for the uniform distribution on the interval [pq), 1], performed on P(j+i), • • • ,P(g), is 
not rejected. As a result, the genes corresponding to the p-values p(p,p(j+i),... ,p(g) are 
considered as no-variation genes. 

Given the concavity of F(p), the goodness-of-ht test used is the one-sided Kolmogorov- 
Smirnov test on positive deviations of the empirical distribution function. 

See Supplementary Movies S4-S6 for examples of this approach to identifying no-variation 
genes. 
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